We are patched or otherwise not vulnerable for the following advisories released in the last few months:
- Xen Security Advisories 296 through 311 (the latest is disclosed 2019-12-11)
- CVE-2018-12207 Improper invalidation for page table updates
- CVE-2019-14607 Unexpected Page Fault In Virtualized Environment
All privilege escalations were patched within 4 days of public disclosure. Normally we patch privilege escalations more quickly. In this case we judged 4 days to be an acceptable amount of delay due to the practical difficulty of an exploit and because it allowed us to reduce our future exposure to certain types of bugs, as we describe below.
Almost 2 months ago we were notified of several Xen security advisories:
- XSA-296 VCPUOP_initialise DoS
- XSA-298 missing descriptor table limit checking in x86 PV emulation
- XSA-299 Issues with restartable PV type change operations
XSA-299 was not easily live-patched, so we decided to live-migrate virtual machines to patched host servers to address these issues.
On Saturday Oct 26 06:10:13 UTC 2019, one of these servers crashed with a Xen backtrace. We decided to continue with moving virtual machines that night, but without using live migration, as we thought that the bug was most likely related to live migration based on how often various functionality in Xen is used.
Within the next 24 hours we attempted to reproduce the bug on our test systems. We had a hypothesis that it was related to NetBSD. The only reason for this is XSA-280, which we found as soon as a NetBSD 5.2 host was booted. NetBSD uses virtual memory in a different way than Linux does, and so it exercises parts of the Xen code that Linux does not. NetBSD is also a less common operating system than Linux.
We found a bug with NetBSD guests that crashed the host, and sent this on to the Xen security team. We very quickly got back a patch that fixed the immediate issue, which generated version 3 of XSA-299.
However, further testing led to at least two additional distinct crashes. Our ability to reproduce this resulted in XSA-309 as at least one of the crashes had been publicly reported but not reproduced. While we again received proposed fixes extremely quickly, we did not have time to do adequate testing of them in advance of public disclosure.
Because of the additional two types of crashes we had found, we decided to delay patching the remainder of our systems until after the embargo had lifted for the above XSAs. This allowed us to disable the feature that lets NetBSD run directly on Xen. Disabling this feature is not allowed under embargo because it can be detected by a virtual machine. Our NetBSD guests are now running in a “shim” which is a virtualized version of Xen.
During the last night we were live migrating guests, we had another crash. We then realized that the original bug we had found was not exactly the same, though it happened during the same operation. There was nothing in the backtrace tying it directly to live migration - it was a bug that happened when the virtual machine was shut down.
It took us a long time to reproduce. Starting and shutting down virtual machines tens of thousands of times did not do it. Live migrating any old virtual machine hundreds of times did not do it. What did it was migrating a 32-bit Linux virtual machine with more than 4GiB RAM about 60 or more times. The live migration failed for whatever reason, and something about shutting down the virtual machine before it had finished migrating triggered the bug. Once we were able to reliably reproduce, we found that some of the patches we had received in response to the second round of NetBSD testing actually fixed this bug. Eventually this resulted in XSA-310.
Once we were confident that we were able to both reliably reproduce the bug and that we had a fix for it, we generated a livepatch for it along with a few other Denial of Service (DoS) fixes.
We tested reboots with these patches tens of thousands of times and performed a couple of thousand live migrations. We also tested the NetBSD related fixes with a few thousand virtual machine reboots. We are now confident that we should see no further crashes from these bugs.