On Tuesday February 07, 2017 at 17:02 -0800 we experienced a Power Distribution Unit (PDU) failure, leading to loss of power in one of our racks. We do not know the root cause as we have not yet found any bad equipment other than some blown fuses, though there is one piece of equipment we haven’t been able to test yet and we have not opened up the old PDU to inspect the circuit boards for overheating.
We replaced the failed PDU and started bringing customers up at 19:48 and finished by 21:00. Complimentary colocation users were brought up Thursday 09 February, as we needed time to vet their equipment.
Lessons learned from this outage:
- While our blog remained up during the outage, the staging server we used to publish updates was down. It took until 18:55 to publish an update on the blog, by hand-editing the HTML to simulate a new post. We posted to our Twitter account, @prgmrcom, at 19:19. The only timely updates were provided on our Freenode IRC channel, #prgmr. Every operator needs to be able to post updates to our official status channels, particularly those that reside outside of prgmr.com.
- We need to have spare PDU(s) on-site. Our spare PDU was the only backup equipment not stored at the data center. We have purchased the equipment necessary to mount a spare PDU in our rack.
- Spare equipment needs to be preconfigured as much as practical and should be tested at least once a year. We had no procedure for testing our spare PDU, which requires 208v power.
- We need to train ourselves to discuss more clearly what we are doing in an emergency situation and receive positive confirmation from the other person in the data center when performing an operation together. We should have a written checklist/procedure in place for any equipment failure that results in a loss of service.
- Ideally we should provide options to customers who want a greater level of redundancy.
srn’s full report (with some edits from adp):
The power outage generated multiple alerts, and logging in to nagios indicated a major outage such as a PDU failure. My initial instinct was to drive to the data center, which was about 20 minutes away from where I was. While I did this I asked Paul to cover support and to post to the blog. He was not able to post to the blog because the git server which is part of that process was hosted on a down server. It did not occur to me to email customers, which for a single server is usually the first thing we do.
Once we determined it looked like a power failure we needed to contact CoreSite (the company we lease our data center racks from). This was not in my phone book, so Alan had to look this up. I also had to detour to our office to pick up our spare PDU, which added at least another 20 minutes.
Once I arrived at the data center it did not take long to determine that the PDU was out. I also smelled burning electronics (apparently fuses don’t smell.) The electrician from coresite determined that the breakers were on and that there was power to the plug. While we had a spreadsheet of what was plugged into the PDU, I decided to verify the labels on the cables, which took another 10 minutes. We unplugged that PDU and plugged in the spare PDU for the first time. We didn’t attempt to fix the old PDU because the controller on it was broken and we could not remotely make power readings, nor did it display its power usage.
At 17:31 we got our first support request, in this case from a colo customer with equipment in that rack. A minute later a VPS customer wrote about their VM. One customer wrote not only to ask about their VM, but to report they could not access the billing system. It’s regrettably common to find you cannot remember credentials to access a website during an emergency!
While the spare PDU was being racked up I posted to Twitter. We determined that the screw holes in the rack didn’t match up with the new PDU, so we had to jury rig mounting with velcro and zip ties. Next we tried plugging in the switch and found that half of the outlets didn’t work and that popping the breakers on the PDU had no effect. We hoped that once we could access the controller on the spare PDU we could get those outlets working, but we didn’t have a serial cable compatible with that PDU.
At 18:55 we simulated a blog post by hand-edited the HTML, bypassing the down publishing pipeline. At 19:19 we posted notice of the downtime to Twitter.
At 18:58 Alan reported on #prgmr that we were “carefully jamming cables in to our spare [PDU].” At 19:13 we started powering on a subset of servers and tried to bring up the customers on them, but the network was down. When we logged into the switch, everything looked fine, so we called Luke. He suggested unplugging all the cables that went into the core switch (the ones running LACP) and plugging them back in one at a time. Once we were able to access the cables, which took a bit of time, this worked. We brought up about 4 or 5 host servers starting at 19:48. Unfortunately the playbook we had for bringing up customers assumed that they had been brought down cleanly and it took me several tries to fix that playbook.
I somewhat panicked at this point because I realized I had done things in the wrong order - we should have powered on everything and then brought up customers, but I was very focused on customers being down and wanted to bring the services back up ASAP. Fortunately this did not cause issues later, but we decided at this point to power the remainder of the servers before bringing any other services back up.
To be able to plug everything in we needed a working serial cable for the PDU as it had been left in a bad state, with critical plugs powered off and no working login credentials. We tried to build a cable but failed because we weren’t looking at the right documentation. I determined that we could plug in the remaining paying customers if we didn’t plug in redundant power supplies, which was not ideal but seemed like the best thing to do at the time. We were able to do this except that one of the colocation customers put the bank into overload so we had to switch them to a different rack. (Later in the week on Wednesday we realized that doing this could have but did not put that rack in overload. Still, too close a call.) By 21:00 all the customer services should have been up.
I felt very strongly about getting credits out in a timely fashion and so this is the next thing we did, except this playbook had not been updated for our new cluster management software and it took me a couple of hours to notice. So customers in our new management system did not get credits until much later. There was also another server that was listed in some places, but not all places, and so it didn’t get credits until even later in the night.
Once that was complete, at home I went through our standard post-boot procedure to check for VPSes that did not come back cleanly. I emailed 11 customers, then I put together and sent the email received on Tuesday night. I also camped on support for a while and didn’t get to bed until probably 5:30 am. Feedback trickled in from impacted customers, incorporated in lessons learned, above.
On Thursday 09 Feb we attended a previously planned meeting with our CoreSite account representative. We asked about on-site storage (for our spare PDU), A/B power, and getting a temporary 208v plug from an adjacent rack. We learned in this conversation that power data was now available from CoreSite, which was not the case when we originally signed our contract. Using an uninterruptible power supply (UPS) and an electric meter (a.k.a., a kill a watt) we verified that we had enough power to plug our complimentory colo users in. In surveying the racks adjacent to the one with the PDU failure we determined that a colocation server we had plugged in Tuesday might be putting one of the banks on that PDU over it’s raited amperage. We moved that server to a different bank on the same PDU. The server had redundant power supplies so this operation did not require powering the server off. Finally, we plugged in the redundant power supplies for all of our servers.
On Friday 10 Feb I went in to a local value-added reseller, UNIXSurplus, who agreed to give us access to a workbench with 208v power. I purchased and fully tested and configured a new spare PDU. Purchasing similar equipment off eBay would not have allowed us to test it before attempting to put it in production. I also used the opportunity to verify whether the old PDU would power on after having fuses replaced (it did) and to learn more about the PDU we had just put into production, which UNIXSurplus had in stock. We were too tired to do more work Friday night.
On Saturday 11 Feb, we gained access to the serial console on the replacement PDU using the procedure tested on Friday. We had hoped this would allow us to power on the unpowered bank. Alas, we confirmed instead the bank is indeed broken. This means we are only able to use half of our power in that rack which is not ideal long term. All of our VPS servers in that rack are either on an automatic transfer switch (which has two inputs) or have redundant power supplies, so in theory we can replace that PDU without disrupting our VPS customers. We also bought mounting equipment to try to mount the spare PDU inside the rack door.
Future action items not otherwise captured:
- Buy more PDUs.
- Come up with a plan for replacing all of our current spine PDUs, as they are all broken in some way.
- Open up the old PDU to see if the electronics are bad, and if we can’t see anything do the same for the remaining equipment we have not yet tested from that rack.