Over the course of January and February we moved our equipment from one datacenter to another, both in Santa Clara, CA. We made several incremental improvements to our infrastructure - as much as we thought we could reasonably manage in the time we had, without causing a serious schedule risk. We started about 2 weeks late compared to our original target and finished about 1.5 weeks late.
First I will describe the improvements made and then I will describe the sequence of events for the move.
Prgmr.com’s founder Luke Crawford said he tried to keep things simple enough that he could debug them at 3am. We still agree with that overall, but we also would like to minimize how much debugging has to happen at 3am.
At the previous data center, the networking was as simple as possible, which is to say that there was no redundancy. During this move, we added what we believe to be the simplest possible redundancy that would not require changing how all of our virtual machine networking works:
- There are now redundant routers with independent BGP sessions and failover for gateway IPs.
- By the end of the week there will be multiple upstream providers each coming into a different switch. Each router has a direct link to each of those switches.
- Each virtual machine host and router has a bonded link in an active-backup configuration, so that if one switch goes down the connection should automatically fail over to the other switch.
- The switches have redundant links to each other such that it should be possible for us to shut down any one switch without losing connectivity for our virtual machine hosts.
We also made the following improvements:
- We added a storage network for our virtual machine hosts that also uses active-backup link bonding. At the previous data center, we had a storage network but in a non-redundant configuration.
- Internally, everything except for management interfaces are now physically 10Gbps.
We made minimal changes to our upstream networking because there was a limit to how many changes we felt able to handle at once.
At the former data center, we had only a single feed into each rack because redundant power is generally very expensive compared to the increase in uptime. But that meant that if there were any problems with any of the PDUs, it would be impossible to switch out the PDU without powering down the entire rack.
We previously had vertical Avocent PDUs because they were supposed to have better logging capabilities. But every single one of them ended up with a broken management interface such that we could no longer remotely read power or control individual outlets. We switched to horizontal PDUs instead to compensate. But the rack layout ended up being extremely poor and difficult to work with.
At the last data center our equipment initially had only single power supplies, though at some point we started buying servers with redundant power supplies.
At the current data center we opted for redundant PDUs (not Avocents) plugged into independent power feeds. Right now we only have vertical PDUs. These PDUs are mounted on opposite sides of the rack near the back such that we should be able to unplug and replace one without powering off the rack. With the vertical PDUs it is easier to visually inspect the power cables to verify they are plugged in all the way than with the horizontal PDUs.
All of our equipment required for our virtual machine hosts to operate, even the network equipment, has redundant power supplies. Each power supply is plugged into a different PDU.
Before Q4 2016 we were still hand-provisioning orders. Q4 2016 we started provisioning using a cluster management system called Ganeti. As part of this data center move we decommissioned the last virtual machine hosts which were not managed by Ganeti. Decommissioning all the legacy systems has removed some operational burden for us and should free up more time to work on other things.
Paying customers all are running on local RAID10’s of SSDs (unless they have made other arrangements) and are on Ivy Bridge or newer Intel CPUs. As noted elsewhere, these servers also have a redundant link for public networking and a redundant link for storage/backend networking, which will give us more flexibility going forward in adding new features.
In December it took a few weeks simply to determine what network and power equipment we wanted and to purchase it. We bought at least one spare for all of our equipment at the same time we purchased the ones intended to go into production. We also had to plan out the racks adequately in advance to be able to buy the right lengths of network and power cables. Locally, only RJ45 network cables and basic power cables are easy to find, so we couldn’t just run to Fry’s if we didn’t have the right things.
For transit, our second transit provider at the former data center didn’t offer service at the new data center and we were on a month to month contract. So we decided to look for a different transit provider.
Having gathered transit quotes, we were very concerned about being able to start moving at the time we wanted to, which was at the beginning of January. Most of the transit providers were quoting us a minimum of 30 days to bring up service.
Fortunately HE.net, whom we are under long-term contract with, was able to turn up transit at the new data center within a couple of days. We were very impressed with how fast HE.net was able to take care of this. Additionally, although we didn’t end up needing it, they allowed us to advertise networks smaller than /24 in case we needed this for traffic management during the data center move.
For splitting our network between data centers, we set up an extra router at the original data center so that we could set up tunneling on it without modifying our primary router. At the new data center, both routers had tunnels for routed traffic and one router had tunnels for bridged traffic. For unencrypted traffic we used LT2Pv3 and for encrypted traffic we used either OpenVPN (slow, no MTU issues) or a GRE bridge over IPSEC (faster, had to be careful about MTU.)
Equipment purchases and planning took more time than we expected, and so did the new network configuration, which cut into our budget for moving VPSes. We had originally thought we might migrate each VPS individually between data centers. But migrating individual VPSes would take longer than moving whole host servers. By the time we had everything ready to host at the new data center, there wasn’t enough time to migrate each VPS individually. There was a limit to the amount of complexity we could handle during the move, so we decided that most existing systems would have their drives moved.
We still individually migrated VPSes off one host server because we wanted to change its storage layout. Consolidating off of our legacy host servers also happened by copying individual VPSes. For consolidating off legacy host servers we used lvsync to copy a snapshot of the data in advance and then shut down the VPSes only to copy the changes made since the snapshots.
For moving the drives of host servers, we would set up a chassis at the new data center. Before shutting down the server at the old data center, we would run an Ansible playbook that configured the new bonded network interfaces and udev rules for the target hardware. After the drives were moved to the new chassis, the system was booted off the network so that there were no potential issues from trying to boot using a different hard drive layout. The GRUB bootloader can be extremely finicky and in the past, before switching to network booting, we sometimes had servers down for several additional hours while we tried to resolve issues with it.
After the old chassis was vacated, we would standardize the internal layout and upgrade the add-on card firmware, BIOS, and BMC firmware before racking the chassis at the new data center. At some point we experimented with disabling the BMC by changing a jumper on the motherboard. But we found out that disabling the BMC also disabled the on-board serial port, and we require serial console access for our systems.
Overall this went smoothly. There was one server for which there were drive issues when we first booted it, so we decided to switch it to a different chassis that had already been prepped for that night. We found the reason for the disk issue was that we had replaced a card after performing a firmware upgrade on the chassis rather than before, so one of the storage Host Bus Adapters had a very old firmware version. That was easily fixed.
While our equipment was split between the two sites, there was a lot of extra traffic due to packets arriving at the original data center that needed to get sent to the new data center, and responses being sent back to the original data center before being routed back out to the general internet. To remove the traffic being sent back to the original data center that was destined for the outside world, we duplicated gateway IPs in both data centers and blocked ARP queries for the IPs in between sites. This mostly worked. Due to gratuitous ARPs when the new gateway IPs came online that weren’t initially blocked, we had traffic that could have all stayed at one site going through a gateway at the other site which mostly saturated our network. But we were able to fix it without major impact.
Turning up BGP at the new data center and turning it down at the old one went without incident. Once we removed the gateway IPs and inbound traffic from the old router, we examined the remaining traffic. We determined that one customer had used a working but non-canonical IPv4 address for the old router as a gateway, and a few other people had used the link-local IPv6 address of the router as their IPv6 gateway instead of the global IP address. We notified those customers and all but one of them appeared to have fixed the issue before we turned off the old router.
There were three unplanned service interruptions during this process. Two incidents, one with IPv6 router advertisements and another with taking down our own virtual machines have already been described on this blog.
There was one more incident not already documented, where a single host server was powered off on accident. This server had dual power supplies. Both power supplies were plugged into adjacent outlets on a horizontal PDU using locking power cables. The cables were not plugged in all the way and someone pressed down on them while removing another set of cables. We found that the locking power cables not being plugged in all the way was a pervasive problem at the former data center, because it takes a lot more force to seat them and not everyone was aware of this. This problem has been fixed at the current data center.
There are a few lingering issues to address. These include:
- Updating documentation
- Integrating applying downtime credits directly into playbooks bringing up hosts
- Updating monitoring to alert on loss of redundancy rather than just total outages
- Bringing up our second transit provider
- Bringing up a redundant out-of-band console
Overall we believe this data center move was a net positive and that we’ll be able to continue with improvements in 2019.