Incident on 2019-02-13
On February 13, 2019 we suffered a service interruption during the transition between data centers. The direct effect on customers was minimal, but the interruption impacted resources such as our billing system and management console. The interruption lasted for 4 hours.
We moved data centers during January and February. Our own equipment was the last equipment we moved, which hosted resources such as the website, the wiki, the blog, the management console and billing website.
While we were split between data centers, we bridged some traffic between the two data centers. We have redundant routers but to avoid bridging loops, the tunnels for the bridges were only on one of the routers.
A DOS attack interrupted the tunnels. Customer traffic resumed when the DOS was over but the tunnels remained down. We tried for half an hour to debug the issue and decided that since we were scheduled to migrate those services the following night anyway, to go ahead and do that immediately so that the tunnels were no longer necessary. Before powering down any equipment we ran a playbook to reconfigure the networking on the servers being moved because we were switching to bonded network interfaces rather than a single interface.
After we moved our own servers we discovered that one of the switch configurations was incomplete. We also learned that the operation failed partway through when we reconfigured networking on one of the servers before moving it. Typically we do separate runs for each server, but this time we did them together. At least one of them succeeded, which initially hid the failure.
Fixing that server’s networking took longer than it should have. At that stage in the move we had not configured a network port such that we could plug in directly to our infrastructure, nor had we gotten ourselves set up on the customer wifi at our new data center and the guest wifi was very slow. Once that was resolved, we realized had not finished NTP access for these servers and we wanted to do that before bringing up our own services.
This was an unusual confluence of circumstances. In particular we are no longer using tunnels, so this sort of incident cannot happen again.