on logging serial consoles.

| | Comments (7)

So every now and again a customer will complain of a crashing domain. Occasionally, it is an early sign of a hardware problem that I need to deal with, so I don't want to just ignore it.

Now, the problem is that like a physical server, once the domain has rebooted, most of the information about why it crashed is gone. (and what little is left is in /var/log on the guest, and as a general rule we don't like mucking around in the guest. that's your business, not ours.)

Now, on a physical server, we solve this by using a logging serial console. (I reccomend opengear if you have the money, and a used cyclades if you don't have money. the 'buddy system' (making one server the console server for the next, then the next server the console server for the first) usually requires adding usb serial dongles, but is even cheaper still, for installations with only a few servers. I personally like the IOgear brand usb -> serial dongles Fry's has.

I can turn on debug logging in xenconsoled and that will log the console for all domains to a file (one file for each domain) then I can use those logs to troubleshoot the problem. The thing is, apparently some people have privacy concerns with this, so I haven't done it yet.

Now, personally, I don't think serial consoles are that sensitive. I mean, it's common to leave terminals in data centers where passers by can see the output. They will allow me to see what program is crashing, which may be sensitive, and depending on how you have the thing configured, I can see when people log in and log out.

So, I have several options.

  1. I could leave it as is, continue to go back and fourth and guess if someone asks me why something crashed after a reboot
  2. I can log all consoles and delete the data once a week or once a month
  3. I can apply a patch to log some people's consoles and not others, and let the user decide

Obviously, option 2 makes my life a /whole lot/ easier. Option 3 is better than option 1, but it still means maintaining an out of tree xenconsoled (or pushing it upstream)

7 Comments

what about this? how about I log the consoles for three days? I'll set it up with logrotate or something so that it rotates it every night and deletes it on the third night? how's that sound?

this makes my life easier as I don't have to manage some opt-out system... and most of the time, if you don't complain in 3 days, well, you aren't going to get help anyhow.

What if you sanitized the logs before saving them. Basically take out any user-specific data like usernames, paths, etc. This might make it harder for you to pinpoint exactly what the problem is, but it will also eliminate some privacy concerns and give you a better idea of what the problem might be. For example, if nginx was a problem, you might not know what file caused the problem, but at least you've narrowed it down to the web server on a particular VM.

the problem with regexing out all identifiable information is that it'd be quite difficult. I'd pretty much need a regex for every common log line type, and even then I'd certainly miss some. At that point it'd be easier for me to setup an opt-in or opt-out mechanism.

I guess the issue is one of trust; at a minimum our data is on your physical servers so we needn't kid ourselves that we have a good security / privacy standpoint from the get go.

I think your customers exhibit a reasonable amount of control over what COULD get spat out onto the console. If your customer's app is so critical that a the console output of a crashed kernel or whatever is too sensitive then you should be running your own server in a rack to which only you have the keys :)

I think I agree with Hilt86 on this one - if you trust someone to run your virtual server, surely you trust them to keep your logs - after all, they have write access to your whole filesystem. I would prefer it if the logs were purged regularly - your three day suggestion is fine, but anything up to a week would work for me.

I agree that the 3 day rotation is reasonable as long as you mention it in the privacy policy. It also might be nice to add a wiki page on general ways to limit what is written to the console.

I concur with John. Three day rotation (except in event of crash or support request, in which case the log is copied elsewhere for examination) + updated privacy policy. Add to that an email blast to all existing customers notifying of privacy policy changes. The wiki page is the icing on the cake.

Leave a comment

About this Entry

This page contains a single entry by luke published on June 15, 2010 1:40 PM.

crock reboot again was the previous entry in this blog.

sorry about the early past-due notices is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.