Follow-up about girdle and mcgrigor downtime

| | Comments (0)
Two events make a trend. Both girdle and mcgrigor had domains constantly restarting domains and both encountered OOM memory conditions at some point. None of the other host servers had domains restarting constantly and we haven't seen OOMs in several years otherwise, so I think preventing excessive domain reboots should avoid the problem.

The xen daemon xend will kill domains that are up for about 10 seconds or less but do not do anything about domains that are up longer than this. xl (the preferred xen toolstack for xen 4) does not do any throttling whatsoever.

In xen it is possible to set the action for what happens if a domain crashes, so we could say to destroy a domain instead of restarting it if it crashes. But we can't make the changes for running domains, and even if we could this wouldn't catch if someone maliciously kept rebooting their server in an attempt to trigger an error.

Yesterday I deployed a daemon that keeps track of whether a domain is restarted too often and destroys it if it does. It also sends us a notification. I hope this will keep similar problems from happening in the future.

As to what exactly triggered the OOM: there are three standard daemons related to xen in CentOS5: xend, xenstored, and xenconsoled. My guess is on girdle the OOM was not a memory leak in xend because xend was killed and the OOM condition was still there. xenconsoled has not been restarted and does not have execessive memory usage. And under certain conditions, many backups can be made of the xenstored database 'tdb' and these are kept in tmpfs, so I think this is the most likely source of the OOM.

We should probably exclude xenstored, which can't be restarted, from the OOM killer by using OOM_DISABLE per This is preferable to disabling the OOM killer entirely.

Leave a comment

About this Entry

This page contains a single entry by srn published on June 13, 2015 10:34 AM.

dish is down, we're working on it was the previous entry in this blog.

cattle/girdle down is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.