Yesterday Xen did public releases of eight different Xen Security Advisories. Most of these were found by the Xen Security Team or key contributers during a pre-release security assessment. All of these were patched on all our running systems before the embargo lifted.

  • XSA-191/CVE-2016-9386 An in-guest privilege escalation. We had some vulnerable systems.
  • XSA-192/CVE-2016-9382 A theoretical in-guest privilege escalation or DOS. No customer systems were vulnerable to the privilege escalation, because we don’t run customers on AMD hardware any more.
  • XSA-193/CVE-2016-9385 A system-wide denial of service. We had some vulnerable systems.
  • XSA-194/CVE-2016-9384 An information leak of host memory. We were not affected because we don’t run this version of Xen.
  • XSA-195/CVE-2016-9383 Unprivileged processes within 64-bit systems could modify arbitary memory. This was the primary reason we patched our systems. All of our systems were affected.
  • XSA-196/CVE-2016-9387,CVE-2016-9388 An in-guest denial of service. No customer systems were affected.
  • XSA-197/CVE-2016-9381 Arbitrary code execution within QEMU. We mitigate this issue by running QEMU within a device model stub domain.
  • XSA-198/CVE-2016-9379,CVE-2016-9380 Leak and removal of arbitrary file on the host file system. We are not vulnerable because we do not use pygrub.

We hope that the live patching functionality (similar to ksplice for Linux) will be production ready in the pending release 4.8. Apparently all the above XSAs could have been patched using this functionality.

All downtime notices were sent at least one week in advance. For all the reboots, one window ran 2 minutes over and all other servers completed within the scheduled maintenance window. After some additional automation made after the first night, the average time to complete from the start of the window was 50 minutes and the actual downtime average was 35 minutes.

We have some ideas for how to decrease the amount of downtime without a large infrastructure investment. The simplest method is to be less cautious about how soon we take down services once we begin the upgrade process.

We had only one request for the downtime window to be moved. We were able to avoid downtime completely for this customer because we had already brought them up on networked storage due to a different XSA. At this time we don’t have the spare hardware to do this for all customers, but there’s a pretty good chance we can do it on request for any customer using our latest management system.

Overall these upgrades went smoothly, but there were a few problems:

  • Some 32-bit domains wouldn’t come up after reboot even though technically there was enough RAM. We’ve come up with a workaround to keep this from happening again. The full problem description and our workaround will be in a separate blog post.

  • It was taking far too long to shut down services. We traced this to a backlog in xenstored from shutting down a large number of services in parallel. Our solution was first to wait a little while in between shutting down each individual service and then put a cap on the size of /var/lib/xenstored in order to limit the amount of backlog. This took us from waiting over 40 minutes and giving up (in two cases) to shutting everything down within about 10 minutes.

  • One server hadn’t been interactively rebooted since installation and it took a while before we realized that the serial console settings were incorrect.

  • One one machine four services were left down for an extended amount of time because a post-update procedure was not followed. We solved this by putting the majority of the post-update process within an ansible playbook to make it easier to follow.

  • Time was way off on one machine because the post-update procedure was not followed. This was also fixed with the same ansible playbook.