June 2015 Archives

Leap second behavior

| | Comments (0)
Older versions of the linux kernel, such as Centos 5, may be using the host server (the dom0) time. If

/sbin/sysctl xen.independent_wallclock

returns 'xen.independent_wallclock = 0', you are using the time we set. This can be changed by running /sbin/sysctl -w xen.independent_wallclock=1 .

On the CentOS 5 host servers, which is the majority of them, a leap second will be inserted. This is what occurred in 2012 and there were no issues on our end.

On the CentOS 6 host servers, the time is being slewed instead of a leap second being inserted per "Workaround 2" on https://access.redhat.com/articles/199563. These servers are as follows:


You can check your current dom0 by looking at the result of 'host <guest>.dom0.xen.prgmr.com'.
We plan to upgrade to the latest version of blesta this Friday the 26th. I expect it will be done by midnight.

cattle/girdle down

| | Comments (0)
UPDATE 6:50PM PDT: It looks like everyone is back up. The 17 affected customers will receive credit for the downtime.
UPDATE 6:35PM PDT: rippleweb says they are in the process of changing out their power hardware and expect us to be back up in about half an hour. It will still be at least a little while until after that for our machines to boot and for everyone to come back up.
UPDATE 5:18PM PDT: rippleweb support has responded and said they're looking into it.
Network outage. I've contacted rippleweb support.

Incidentally, anyone who would like to be moved from this facility just contact us but we won't force you to move (at least not right now.)
Two events make a trend. Both girdle and mcgrigor had domains constantly restarting domains and both encountered OOM memory conditions at some point. None of the other host servers had domains restarting constantly and we haven't seen OOMs in several years otherwise, so I think preventing excessive domain reboots should avoid the problem.

The xen daemon xend will kill domains that are up for about 10 seconds or less but do not do anything about domains that are up longer than this. xl (the preferred xen toolstack for xen 4) does not do any throttling whatsoever.

In xen it is possible to set the action for what happens if a domain crashes, so we could say to destroy a domain instead of restarting it if it crashes. But we can't make the changes for running domains, and even if we could this wouldn't catch if someone maliciously kept rebooting their server in an attempt to trigger an error.

Yesterday I deployed a daemon that keeps track of whether a domain is restarted too often and destroys it if it does. It also sends us a notification. I hope this will keep similar problems from happening in the future.

As to what exactly triggered the OOM: there are three standard daemons related to xen in CentOS5: xend, xenstored, and xenconsoled. My guess is on girdle the OOM was not a memory leak in xend because xend was killed and the OOM condition was still there. xenconsoled has not been restarted and does not have execessive memory usage. And under certain conditions, many backups can be made of the xenstored database 'tdb' and these are kept in tmpfs, so I think this is the most likely source of the OOM.

We should probably exclude xenstored, which can't be restarted, from the OOM killer by using OOM_DISABLE per http://linux-mm.org/OOM_Killer. This is preferable to disabling the OOM killer entirely.

dish is down, we're working on it

| | Comments (0)
UPDATE: As of about 7:58 dish is back up. We don't have a root cause. We will be prioritizing moving the 14 VMs on this physical server to a different host.
As of 6:58 PDT. I don't have any more information as of this time.
TL;DR summary: if you have a VM on white your system clock will be off by several hours under one of two conditions:

1. The VM rebooted since April 15th and doesn't set the time on boot with ntpdate or ntpd -q
2. The VM has /proc/sys/xen/independent_wallclock set to 0.

Modern kernels don't have independent wallclock any more and you're supposed to run ntp, see  http://wiki.xenproject.org/wiki/Xen_FAQ_DomU.

I believe I have a sane way to fix it but it will take a few days to converge.  Read on for details.

If you are affected due to independent_wallclock please reach out to us; we will contact the 3 people who have rebooted.

More details:

One problem with white was that while ntpd was running, it was not syncing with upstream servers. I thought that if it wasn't syncing that ntpd should quit. Unfortunately that was not the case. This is the part of the output from 'ntpstat' when that happens:

synchronised to local net at stratum 11

The offending lines in ntp.conf were:

server     # local clock
fudge stratum 10 

You would think this would only apply if upstream servers couldn't be reached. There are servers with an identical ntp.conf and firewall rules and they are synchronized, which should rule out connectivity issues related to those.

My best guess from looking at the logs is that it made an attempt at boot to connect before the networking was working and at some point gave up on contacting anything upstream but I haven't spent a whole bunch of time researching it. At boot, the time was off by about 3 minutes according to our logs and this is less than the 1000 second cutoff that ntpd is supposed to allow before giving up.

ntpd not actually syncing is a problem, but it wasn't the only problem. The drift was larger than normal. Using adjtimex -p I saw there was a  "tick" value of 10025 for the number of microseconds between clock ticks, while on almost all the other servers it is 10000. Assuming the counter is 100Hz, if 2.5ms was added to the system clock per real second, this is on the same order of error that we saw.

I don't know if ntpd set the tick value to be way off or if it was loaded from /etc/adjtime, but I think if ntpd had actually been syncing it wouldn't have been as big of a problem.

But the domUs, unless they're set to use time from the dom0, maintain their own clock which is not affected by this tick value. I know this because I had a canary server on white with about the right time and this server was not using ntp, had no independent_wallclock parameter, and had not been rebooted since April 15th.

Unfortunately the time is too far off for ntpd to converge right now. So the way I am fixing this (since many programs become fussy if time goes backwards) is to set the tick value using 'adjtimex -t' to be slower than real-time until the system time is within what ntpd will handle. I tried changing the tick value on a test server and letting it run for a while and it seems to work correctly. Once the time is reasonably sane I will delete the drift file, run 'hwclock --systohc' and restart ntpd without '-q' and without a local time source allowed.

Corrective action going forwards is to manage the centos 5 ntp.conf files along with centos 6 and delete /var/lib/ntp/drift and make a dummy /etc/adjtime file if switching chassis.

~20m network outage on girdle

| | Comments (0)
Part of the antispoof rules were cleared on girdle which resulted in all traffic being blocked. After logging in via the console I manually fixed up the antispoof rules to allow traffic again. I will post again when the root cause has been addressed.
We had heavy packet loss for about fifteen minutes to wilson... to nowhere else.  Tcpdump showed almost no packets going in or out, very consistent with a physical layer networking issue, but netstat -i is clean, and dmesg indicates no problems...

It is a mystery.  It's been good for ten minutes now.   If it happens again, it will wake us;  and we've got enough spare hardware to swap it out if it is, in fact, a physical layer problem.  

About this Archive

This page is an archive of entries from June 2015 listed from newest to oldest.

May 2015 is the previous archive.

July 2015 is the next archive.

Find recent content on the main index or look in the archives to find all content.