Fri, 05 May 2017 06:32:00 -0700 - Alan Post
XSA-213 fixes a 64-bit PV guest breakout via pagetable use-after-mode-change. XSA-214 fixes a page transfer that allows a PV guest to elevate privileges. XSA-215 fixes a privilege escalation via failsafe callback. The Xen Project Blog describes this set of security patches, and references a Live Patching feature in Xen 4.7. Given the increased frequency of security vulnerabilities in Xen, we’ll be evaluating live patching. Like the prior security patches we applied, these bugs were discovered by Jann Horn from Google Project Zero.
Our average downtime this maintenance cycle was 35 minutes. The longest host server took 67 minutes, while the shortest took 15 minutes. This was less total downtime than previous maintenance windows, representing incremental improvements in our patching process. We also used this maintenance window to decommission two servers with HDDs to a new one with SSDs. This is the same server we consolidated on to in our last maintenance window.
Mon, 17 Apr 2017 19:00:00 -0700 - Sarah Newman
We’ve added the OpenBSD 6.1 installer with boot.conf edited to use the serial console. It was possible to use the OpenBSD installer before but there were extra steps involved to get it onto a VPS and to set the installer to use the serial console. Please let us know if you encounter any issues.
Mon, 10 Apr 2017 08:00:00 -0700 - Alan Post
From Thursday March 30th through Monday April 3rd we performed unavoidable system maintenance to patch CVE-2017-7228 / XSA-212. XSA-212 fixes an insufficient check on XENMEM_exchange input which permitted PV guest kernels to write to hypervisor memory outside of the provided input/output arrays.
This bug was discovered by Jann Horn, and is explained in detail at Project Zero: Pandavirtualization: Exploiting the Xen hypervisor.
Given that XSA 212 meant unavoidable downtime, we used the opportunity to perform some cleanup work resulting from our February 07th PDU failure. As we stated in that retrospective, we purchased and brought on-site spare and replacement power equipment. On the Saturday of our maintenance window we replaced our half-broken PDU with a fully working spare. We also pulled the remaining dead power equipment.
Additionally, we consolidated two servers with HDDs to a new one with SSDs. This new server has enough additional capacity to consolidate two other servers, which we’ll perform in a future maintenance window.
Mon, 03 Apr 2017 09:18:20 -0700 - Alan Post
On Tuesday April 4th at 05:00 UTC we’ll perform scheduled maintenance on our internal systems. The website (including the blog and billing), management console, and the support ticket tracking system will be unavailable for 20 minutes.
We’ll provide real-time updates in #prgmr on Freenode.
Tue, 14 Mar 2017 10:34:29 -0700 - Alan Post
UPDATE: we can definitely say the problem was due to lack of firmware updates and that this exact problem should be resolved going forwards. We have asked our hardware vendor for clarification on our respective responsibilities regarding firmware updates and suggested to Broadcom that they publish their firmware release notes online in an indexed web page in addition to PDF.
Another problem not originally mentioned was that we were not able to send email notifications due to a missing validation in our email scripts. The email notifications sent at the time were invalid due to an extra newline in the email headers. That has been fixed.
On Wednesday March 8th at 20:11 -0800, we received an email from one of our dom0s stating one of it’s disks had a SMART error. This kind of error is usually routine. We use RAID on all of our servers, and a hard drive failure indicates a maintenance task that needs to be performed, but is not by itself an emergency.
This event was unusual, however, in being on our newest machine. A hard drive failure here would indicate what is sometimes referred to as infant mortality. We logged in to the machine immediately. It became apparent that we had a problem much more serious than a failed hard drive. Five minutes later, by 20:17 four drives had failed out of our RAID array, which did not leave enough drives to operate the array.
We had caught this problem early enough that we were already on the way to the data center when the first page came in. Sarah had noticed that all four of the drives kicked from RAID were going through a Broadcom/LSI 9300-4I PCI SAS/SATA expander, rather than being direct-attached, and that no drives other than the other ones on this adapter had failed.
There was a Xen warning at the time of the failure, however after investigating we believe this is a side effect and not the cause:
(XEN) Xen WARN at msi.c:510 (XEN) ----[ Xen-4.6.3-5.el6 x86_64 debug=n Not tainted ]---- [Wed Mar 8 20:03:49 2017](XEN) CPU:1 (XEN) RIP:e008:[<ffff82d08016b532>] unmask_msi_irq+0x22/0x30 (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor ... (XEN) Xen call trace: (XEN)[<ffff82d08016b532>] unmask_msi_irq+0x22/0x30 (XEN)[<ffff82d08016ce39>] ack_maskable_msi_irq+0x9/0x40 (XEN)[<ffff82d08016e81e>] do_IRQ+0xbe/0x690 (XEN)[<ffff82d08022ea7f>] common_interrupt+0x5f/0x70 (XEN)[<ffff82d0801ad856>] mwait_idle+0x276/0x380 (XEN)[<ffff82d0801055fa>] do_vcpu_op+0x17a/0x670 (XEN)[<ffff82d08012b4c0>] __do_softirq+0x70/0xa0 (XEN)[<ffff82d0801623ae>] idle_loop+0x1e/0x50
The exact error from the card are as follows:
mpt3sas0: fault_state(0x0d03)! mpt3sas0: sending diag reset !! mpt3sas0: diag reset: SUCCESS mpt3sas0: _base_event_notification: timeout mpt3sas0: _base_fault_reset_work: hard reset: failed
Recovering from this failure required force-reassembling the RAID. Having to perform this operation extended our downtime–we spent an unusual amount of time investigating and checking our work as we went. One potential issue we already know about it was that we had a miscommunication with our hardware vendor over who was responsible for firmware updates and the firmware was not up to date despite the hardware being newly purchased. We’re in contact with our hardware vendor and Broadcom about this failure. While we have not used this model of card before, we have used Broadcom for years. If we are unable to get a satisfactory answer from Broadcom on what the problem is we’ll likely switch away from using the 9300-4i.