As you may be aware already, Xen recently did public releases of several Xen Security Advisories . As a public service provider, we believe it is a requirement to be fully patched for any critical vulnerabilities for public release. Specifically, we will do upgrades for information leaks and for privilege escalation vulnerabilities, but not DOS vulnerabilities, since currently we would be disrupting service to provide them.
This particular round of reboots was primarily due to XSA-155. XSA-155 can lead to privilege escalation, which matches one of the advisories we will perform upgrades for.
We barely met our goal to be fully patched by the time of public release - the last round of VMs came up about 5 minutes before the embargo deadline. We aren’t sure if this is terrible or perfect planning since we gave as much notice as we could before doing the reboots.
Here is a too detailed account of what happened:
Friday December 4th: The first notice for XSA 155 was received this morning. I reviewed our current set of servers and came up with a upgrade schedule. The schedule was structured with the following goals:
- Customers would receive at least one week’s notice that they would be rebooted.
- There would be one day of not very many reboots in advance of the bulk of the work such that we could work out updates to our ansible playbooks and procedure when only one server was down at a time, and that this would be on a day which was not during business hours for most people (IE Friday night.)
- These test upgrades would address at least one of each operating system we ran at the time, specifically CentOS 5 and CentOS 6.
- The upgrades we believed could be the most problematic were to be scheduled for the weekend.
- The majority of reboots would occur on the weekend for the majority of our customers - most of our customers are in the US. So the majority of the updates happened on Saturday and Sunday. These started at 20:00 PST and were originally scheduled to end at 2:00.
- Any upgrades that had to occur during the week would be at night for the majority of our customers - this ended up originally being 21:00 and 23:00.
Of course this changed, as I will get into later.
The above schedule was completed the same day. All initial notifications emails were sent by 20:00 PST -0800, starting with reboots on December 11th.
We found out later some people did not see this email. We don’t know if the email simply got missed or if it went to spam. Please white list prgmr.com and add any extra emails you want to get notified as technical contacts, see our wiki page on managing contacts .
Also on Friday, an email was sent to CentOS requesting any updates to packages so we could start from the latest.
Saturday 5th and Sunday 6th: spent trying to get the next version of the distributions finished (was not able to in time)
Monday 7th: Sanity day
Tuesday 8th: We were initially warned that there were circumstances under which the patches wouldn’t be effective, specifically if a compiler had gcc bug 58145. Since there was no CVE for the bug and no good way to know which versions were affected, I went back to the original test case for that bug and ran it for the CentOS 6 and CentOS 5 compilers. I am not terribly familiar with x86 assembly so it took me a while to go through each line and understand what was going on, but I confirmed that the compiler bug was not present for CentOS 6 and CentOS 5.
Wednesday 9th: Reviewed what XSAs were applied to CentOS 6 to try to confirm there would not be any regressions. I worked backwards and mostly trusted the changelog. There was one XSA not applied (142) but it does not impact us since it is for file backed storage. I sent an email to the CentOS-virt sig informing the group. For the XSA review, I went back as far as XSA 60 but at some point they did a complete import from vanilla xen and I decided that was probably far enough.
I applied the patches to the latest Xen4CentOS package and rebuilt. It was first loaded on an HVM dom0, during which we found that oxenstored did not work properly with xl console so that change got added to our ansible roles. It next got loaded on our test box.
Thursday 10th: I reviewed the XSA history for CentOS 5 and discovered that Redhat had not patched some CVEs that we performed reboots for back in February and March of this year - most specifically CVE-2015-2151. We had regressions on two machines that were updated to the latest CentOS 5 due to this. I also looked at backporting the patches for XSA155 and found that the code differed so much that I did not have any confidence in my ability to correctly patch the code.
At this point I didn’t see any option but to upgrade to CentOS 6 from CentOS 5 (we’ve already been running CentOS 6 in production for a year) which was a significant change in strategy. However, I had done some work on this back in December 2014 and it had mostly been working, so it was not a stretch to change plans this late.
Doing the CentOS 5 to CentOS 6 change required booting into rescue mode. This was practical for all of our machines except for the two in Sacramento that were already scheduled to be decommissioned. We made the decision to take a server up there already with CentOS 6 installed (at least that was the plan) and then move the drives from the old machines to the new ones. The first of the ones in Sacaramento had initially been scheduled for Friday night but we weren’t going to be ready, so I rescheduled from Friday to Saturday night. Since Luke was going to have to slog up to Sacramento etc. and I was probably going to need to help with the migration, I rescheduled two more updates from Saturday night to Tuesday and Wednesday night at 21:00 -0800 PST respectively. These were all sent with at least 24 hours notice. We started the machine slated for Sacramento memory testing.
Friday 11th: Because there was not enough time to test but Friday is probably the best time for downtime, I didn’t want to reschedule to a different night. So I sent an email to the CentOS 5 customers for that night letting them know I didn’t know exactly when the maintenance was going to start or end. This happened about 7.5 hours before the window was scheduled to start.
I worked through changes to our playbook for updating CentOS 6 with our test machine. Since there had been other reboots for XSAs with CentOS 6 quite recently I did not do a full run. I ran through some of the procedure for the CentOS 5 to CentOS 6 upgrade with an HVM dom0 but was not able to complete this before the scheduled time for the CentOS 6 upgrade. The CentOS 6 upgrade went well. After this I booted a decommissioned CentOS 5 host and went through the full procedure for the upgrade. The upgrade for the production CentOS 5 server started about 3 hours late (~ 1:00 -0800 PST Saturday) and ended about about 2 hours late (~ 2:00).
Saturday 12th: I got most of the way through installing CentOS 6 on the new Sacramento server, but was not able to finish by the time Luke needed to catch his train.
Saturday night did not go as well. Here are some of the problems encountered:
- Difficulty breaking into bios to do netboot due to lack of function keys on chromebook
- Changes in names of network devices between CentOS 5 and CentOS 6. On subsequent nights this was avoided because we added a template for /etc/udev/rules.d/70-persistent-net.rules to the ansible roles.
- chroot script did not unmount things in the correct order for all hosts and it was overly aggressive about cleaning up after itself, so the work for at least a couple of servers was lost and had to be redone. The script was fixed that night.
- The procedure was not as well documented as it should have been and Friday night wasn’t used as a training run, so people missed some steps and misunderstood some steps.
- People running with a more modern version of ansible that implemented the yum module differently such that a repository was not getting installed before use.
- On one server, the serial console text turned black during the boot process so we initially thought a server wasn’t booting when it was.
- On one server, there was a domain that should have been canceled but wasn’t such that the last domain to boot couldn’t because there wasn’t enough memory, and we didn’t notice until someone complained. We had had checks for things like this before and they got added back in for subsequent nights.
- Our ansible playbooks had some hacks for older servers that were originally conditional on CentOS 5 but shouldn’t have been any more and we didn’t discover this until customer reports the next day of not being able to access the management console. It was fixed as soon as we found out about it.
- A number of VMs failed to boot for problems we did not have time to look into, but we sent emails about it. More on this later.
On average this ran an hour over the scheduled maintenance window. For the Sacramento servers, we had to finish installing CentOS 6 and also had to write/test the playbooks to migrate the customer data. The actual downtime for the Sacramento customers was probably less than half an hour but it occurred about an hour and a half after the scheduled maintenance window end.
Sunday 13th: We got paged and spent some time on network issues due to another HE.net customer being DOS’ed. This is significant only so far as it interfered with other work. There were new problems due to:
- ssh taking a while to die, such that we had to start manually killing it.
- A bug in an ansible playbook had been fixed a while ago, but the buggy changes it originally made weren’t, such that some xen configuration files were invalid.
- Different bios versions having different navigational methods.
- Unusual raid configurations due to us adding on disks later combined with drive letter changes meant assembling the raid by hand was difficult, and –assemble –scan didn’t always quite work.
- For some servers we encountered kernel messages related to mptsas: swiotlb full under heavy disk load. One server was down for an additional two and a half hours beyond what was originally scheduled while we tried to make it go away. Research eventually suggested this is a performance problem only, at which point we decided to let it be. One host server that was upgraded early is occasionally throwing this error, but we picked a larger value for subsequent hosts and for these we aren’t seeing problems under normal operation. We intend to investigate this in further detail later, as if and when we start using SSDs it will probably become a limiting factor.
- Save/restore of memory (equivalent to hibernation) by xendomains wasn’t being disabled as it should have been and for one server that had to be rebooted to increase the value of swiotlb, it started saving the VM memories which ended up filling the disk. I tried to make it so that the machines that were saved would be resumed again after the reboot, but the change we originally intended (to disable save/restore) was in place so those domains weren’t restored. This meant those 5 or 6 VMs were uncleanly shut down. Disabling of save/restore was fixed in our ansible playbooks for subsequent runs.
- One server that had been switched between chassis didn’t have the right mac address in its DHCP entry. This is one thing that hasn’t been templated.
- One server had its ip6tables rules written out while it only had an ipv4 address, so ipv6 wasn’t working for it after reboot though it worked for the VMs. We removed the ipv6 AAAA records for it instead of trying to fix the ip6tables rules.
All of the servers except the one that run 2.5 hours over, ran one hour over the originally scheduled window.
Monday 14th: We got more concrete reports about VMs failing to boot under Xen 4. The worst case was the person who was running Fedora 17 as there is no upgrade path. We ended up giving them a clean install of Centos 7 and attached their old block device as an extra disk.
We had previously moved at least a couple of hundred VMs from Xen 3 to Xen 4 so this was a surprise to us. Unfortunately we didn’t see a way to stay on Xen 3 and fix the XSAs. Possibly we should have left one box as CentOS 5 with the understanding it was going to be vulnerable, but we didn’t have time to set up a new server as we didn’t want to use an existing server with other customer data on it.
We wanted to delay the schedule to give people more time to upgrade the kernels, but this would have made it even more difficult to be patched by the time the embargo was lifted.
We wrote wiki pages on accessing your disk from the rescue image and guest kernel changes for xen 4 and sent an email to the remaining people who were going to be upgraded onto Xen 4. It was very frustrating because people would write us asking if their kernel versions would crash and we couldn’t really answer because we didn’t (and don’t) know what the underlying problem was, plus there are too many kernel versions out there. This is the downside to running your own kernel - we can’t test every version. In the future we will consider an option for us to manage people’s kernels instead of using pvgrub(2).
The upgrades for Monday night all completed within their window.
Problems this night:
- One server had its upgrades completed within the maintenance window but I fat fingered my password without noticing and the emails about the server going up weren’t sent.
- One server had the module line for initramfs missing in grub.conf (a known problem with centos) and since we haven’t gotten to adding a playbook entry to fix it, we’ve been hand fixing it when it happens. But there are differences in the partition layouts and the wrong version was being used such that the initramfs was not being found, which delayed the upgrade completing.
- One server had a root partition much smaller than normal and there wasn’t enough room to install a new kernel onto it, so space had to be cleared out on it.
- One server, after being rebooted into the CentOS 6 rescue, was throwing warnings about some physical volumes for LVM not being found. One partition that had originally been in the raid was also not marked as being a raid component after reboot. From the CentOS 5 root file system the physical volumes were identified correctly. Since continuing with CentOS 6 seemed like a risky idea, we rebooted into the original CentOS 5 and used our move playbook to move VMs individually from this server to another server that had enough room. I considered other methods but the move playbook is well tested and I knew it would work, even if it was slower than it could have been by doing a more direct data transfer. People on this server had up to an extra 8 hours of downtime, though it was in the middle of the night for most of the customers and we tried to prioritize moves based on time zone. We have given them a months credit already.
- Due to this one server being extremely problematic and requiring attention from multiple people, another server was rescheduled from Tuesday to Wednesday night with less than 24 hours notice.
Two of the servers were completed on time, one was an hour late, and as mentioned one was up to 8 hours late.
- When a drive was replaced on a CentOS 5 host during the maintenance window but prior to it being upgraded, there was a null pointer bug in the kernel and it spontaneously rebooted, so all those people were shut down uncleanly. We hope the same problem does not exist in CentOS 6.
- The pxe boot menu was generally working with a serial console number of ‘0’ even if the linux serial console number was something other than 0. On one host it was the not the case and it took a number of reboots before isolating and fixing the problem, since none of the other hosts (even ones hypothetically of the same configuration) had the same issue.
- Someone forgot to start the VMs after completing the upgrade so there was about an additional hour and a half of unecessary downtime for most people on that host server. We have “canaries” that should be in our monitoring but nobody has gotten to it. It should be easy to generate the configuration files for them.
- One host was going OK during the upgrade but spontaneously shut down during the rescue image portion of the upgrade. I went and looked at the chassis and it had a red blinking light for health. I dug a little but it seemed best to move the drives to a new chassis. After ironing out a few issues with netboot it went smoothly, but the it ran over the maintenance window by about 1.5 hours due to the hardware failure.
- The final upgrade went smoothly - it took about an hour and 15 minutes - and completed about 5 minutes before the embargo lifted. This was the one that was rescheduled from Tuesday night. I made the window 4 hours instead of 2 because I wanted to do it as soon as the previously scheduled upgrades had completed.
The public facing server for our own infrastructure was upgraded in about half an hour.
We intend to give a proportional credit (IE 1/744 of a months payment for 1 hour of downtime) for all of the maintenance outside of the originally scheduled window with the exception of the server where we already gave a months credit. This should be completed within the next week.
We need to investigate the mptsas/swiotlb kernel issue more.
We need to automate changing people’s public keys. This has been a TODO for a long time but usually support is light enough we are able to respond in a reasonable amount of time. But during this, it was very urgent for some people and we were extremely short staffed so sometimes it did not happen in time and people suffered. But there was a hard deadline and so we had to prioritize the needs of the majority of our customers. I highly recommend people check out GPG smart cards or possibly yubikeys so that their public key is not tied to a particular computer.
We should add the already allocated VM canaries to our monitoring.
We intend to write a follow up post discussing the different strategies for decreasing downtime. It won’t be something we can implement in the short term for all our hosts. However, now that all of our servers are running the same operating system, and one that is more modern, this should help us a lot with maintenance and long term goals.