UPDATE: we can definitely say the problem was due to lack of firmware updates and that this exact problem should be resolved going forwards. We have asked our hardware vendor for clarification on our respective responsibilities regarding firmware updates and suggested to Broadcom that they publish their firmware release notes online in an indexed web page in addition to PDF.
Another problem not originally mentioned was that we were not able to send email notifications due to a missing validation in our email scripts. The email notifications sent at the time were invalid due to an extra newline in the email headers. That has been fixed.
On Wednesday March 8th at 20:11 -0800, we received an email from one of our dom0s stating one of it’s disks had a SMART error. This kind of error is usually routine. We use RAID on all of our servers, and a hard drive failure indicates a maintenance task that needs to be performed, but is not by itself an emergency.
This event was unusual, however, in being on our newest machine. A hard drive failure here would indicate what is sometimes referred to as infant mortality. We logged in to the machine immediately. It became apparent that we had a problem much more serious than a failed hard drive. Five minutes later, by 20:17 four drives had failed out of our RAID array, which did not leave enough drives to operate the array.
We had caught this problem early enough that we were already on the way to the data center when the first page came in. Sarah had noticed that all four of the drives kicked from RAID were going through a Broadcom/LSI 9300-4I PCI SAS/SATA expander, rather than being direct-attached, and that no drives other than the other ones on this adapter had failed.
There was a Xen warning at the time of the failure, however after investigating we believe this is a side effect and not the cause:
(XEN) Xen WARN at msi.c:510 (XEN) ----[ Xen-4.6.3-5.el6 x86_64 debug=n Not tainted ]---- [Wed Mar 8 20:03:49 2017](XEN) CPU:1 (XEN) RIP:e008:[<ffff82d08016b532>] unmask_msi_irq+0x22/0x30 (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor ... (XEN) Xen call trace: (XEN)[<ffff82d08016b532>] unmask_msi_irq+0x22/0x30 (XEN)[<ffff82d08016ce39>] ack_maskable_msi_irq+0x9/0x40 (XEN)[<ffff82d08016e81e>] do_IRQ+0xbe/0x690 (XEN)[<ffff82d08022ea7f>] common_interrupt+0x5f/0x70 (XEN)[<ffff82d0801ad856>] mwait_idle+0x276/0x380 (XEN)[<ffff82d0801055fa>] do_vcpu_op+0x17a/0x670 (XEN)[<ffff82d08012b4c0>] __do_softirq+0x70/0xa0 (XEN)[<ffff82d0801623ae>] idle_loop+0x1e/0x50
The exact error from the card are as follows:
mpt3sas0: fault_state(0x0d03)! mpt3sas0: sending diag reset !! mpt3sas0: diag reset: SUCCESS mpt3sas0: _base_event_notification: timeout mpt3sas0: _base_fault_reset_work: hard reset: failed
Recovering from this failure required force-reassembling the RAID. Having to perform this operation extended our downtime–we spent an unusual amount of time investigating and checking our work as we went. One potential issue we already know about it was that we had a miscommunication with our hardware vendor over who was responsible for firmware updates and the firmware was not up to date despite the hardware being newly purchased. We’re in contact with our hardware vendor and Broadcom about this failure. While we have not used this model of card before, we have used Broadcom for years. If we are unable to get a satisfactory answer from Broadcom on what the problem is we’ll likely switch away from using the 9300-4i.