Yesterday Xen did public releases of a couple of different Xen Security Advisories. The most critical one for us was CVE-2016-6258/XSA-182, which is titled “x86: Privilege escalation in PV guests”. All systems hosting customer administered systems were patched for this vulnerability at least 24 hours before the embargo lifted.
We sent downtime notices the day after first learning of this vulnerability. Notices were sent at least a week in advance of any downtime, and customers had the option of rescheduling the downtime to be earlier.
Overall these reboots went relatively well, despite our founder Luke being unable to work due to a recent hospital visit. None of of the downtimes exceeded our original plan, but in absolute terms we went over by up to about half an hour in a few cases due to starting late.
That being said, there were still a few problems:
- We have one oddball server with two HBA (scsi expansion) adapters rather than one. The server would not boot until the drives were rearranged within the chassis - either different bays or a different drive order fixed it. It did not look like a GRUB problem.
- One server ended up having an extremely slow disk after reboot. I tried to add an external bitmap before temporarily kicking one of the drives and accidentally deadlocked the server by putting the bitmap on top of the RAID device the bitmap was for. The command would have worked with other servers but not the given one due to the root partition being on a different device. I don’t think there’s a prevention strategy here except better adhering to KISS.
- A playbook that shut down services was accidentally run on a server that had already been upgraded, leading to about 10 extra minutes of downtime for around 20 services. It may be possible to prevent this in the future with environment variables, but it depends on people’s individual workflows.
We also put additional hardware into production and gave a RAM upgrade to around half of the VMs. We don’t have concrete plans for upgrading the rest of the VMs yet but hope to do so within the next few months.