Time is really bad on white but will be fixed

| | Comments (0)
TL;DR summary: if you have a VM on white your system clock will be off by several hours under one of two conditions:

1. The VM rebooted since April 15th and doesn't set the time on boot with ntpdate or ntpd -q
2. The VM has /proc/sys/xen/independent_wallclock set to 0.

Modern kernels don't have independent wallclock any more and you're supposed to run ntp, see  http://wiki.xenproject.org/wiki/Xen_FAQ_DomU.

I believe I have a sane way to fix it but it will take a few days to converge.  Read on for details.

If you are affected due to independent_wallclock please reach out to us; we will contact the 3 people who have rebooted.

More details:

One problem with white was that while ntpd was running, it was not syncing with upstream servers. I thought that if it wasn't syncing that ntpd should quit. Unfortunately that was not the case. This is the part of the output from 'ntpstat' when that happens:

synchronised to local net at stratum 11

The offending lines in ntp.conf were:

server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10 

You would think this would only apply if upstream servers couldn't be reached. There are servers with an identical ntp.conf and firewall rules and they are synchronized, which should rule out connectivity issues related to those.

My best guess from looking at the logs is that it made an attempt at boot to connect before the networking was working and at some point gave up on contacting anything upstream but I haven't spent a whole bunch of time researching it. At boot, the time was off by about 3 minutes according to our logs and this is less than the 1000 second cutoff that ntpd is supposed to allow before giving up.

ntpd not actually syncing is a problem, but it wasn't the only problem. The drift was larger than normal. Using adjtimex -p I saw there was a  "tick" value of 10025 for the number of microseconds between clock ticks, while on almost all the other servers it is 10000. Assuming the counter is 100Hz, if 2.5ms was added to the system clock per real second, this is on the same order of error that we saw.

I don't know if ntpd set the tick value to be way off or if it was loaded from /etc/adjtime, but I think if ntpd had actually been syncing it wouldn't have been as big of a problem.

But the domUs, unless they're set to use time from the dom0, maintain their own clock which is not affected by this tick value. I know this because I had a canary server on white with about the right time and this server was not using ntp, had no independent_wallclock parameter, and had not been rebooted since April 15th.

Unfortunately the time is too far off for ntpd to converge right now. So the way I am fixing this (since many programs become fussy if time goes backwards) is to set the tick value using 'adjtimex -t' to be slower than real-time until the system time is within what ntpd will handle. I tried changing the tick value on a test server and letting it run for a while and it seems to work correctly. Once the time is reasonably sane I will delete the drift file, run 'hwclock --systohc' and restart ntpd without '-q' and without a local time source allowed.

Corrective action going forwards is to manage the centos 5 ntp.conf files along with centos 6 and delete /var/lib/ntp/drift and make a dummy /etc/adjtime file if switching chassis.

Leave a comment

About this Entry

This page contains a single entry by srn published on June 10, 2015 5:43 PM.

~20m network outage on girdle was the previous entry in this blog.

dish is down, we're working on it is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.