• Scheduled maintenance for billing system

    Sun, 14 Apr 2019 16:30:00 -0700 - Alan Post

    Update: Our billing system is upgraded and the maintenance window closed.

    For up to two hours starting Sat Apr 20 2019 02:00 UTC we will take our billing system offline for a software upgrade.

  • Distributions updated: various rebuilds, new rescue image, and new installable Arch Linux image

    Sun, 14 Apr 2019 11:00:00 -0700 - Chris Brannon

    We made several updates and one addition to our distribution images and netboot installers:

    • We now provide an installable Arch Linux image, and customers will be able to order or reinstall a VPS with Arch Linux.
    • Our rescue image is now based on Debian Stretch, rather than Debian Jessie.
    • The Arch Linux netboot installer was updated to 2019.04.01.
    • The Alpine Linux netboot installer was updated to the 3.9.2 release.
    • The docker variants of Ubuntu Bionic and CentOS 7 were rebuilt for CVE-2019-5736.
    • All of our Debian and Ubuntu images were rebuilt for CVE-2019-3462.
    • Debian Jessie images no longer contain jessie-updates in /etc/apt/sources.list, because the Debian FTP masters pulled it from Debian’s FTP site in March. Security updates are still provided, however.

    These distribution images and netboot installers are available from the management console of any Prgmr.com VPS.

  • Network maintenance window this weekend - no downtime expected

    Tue, 26 Mar 2019 14:00:00 -0700 - Sarah Newman

    During this Saturday 2019-03-30 3:00 UTC to 6:00 UTC and Sunday 2019-03-31 3:00 UTC to 6:00 UTC we’ll deploy network changes to bring up an additional upstream connection. We don’t expect any downtime related to these changes.

  • 2019 Datacenter Move

    Tue, 26 Mar 2019 12:50:00 -0700 - Sarah Newman

    Over the course of January and February we moved our equipment from one datacenter to another, both in Santa Clara, CA. We made several incremental improvements to our infrastructure - as much as we thought we could reasonably manage in the time we had, without causing a serious schedule risk. We started about 2 weeks late compared to our original target and finished about 1.5 weeks late.

    First I will describe the improvements made and then I will describe the sequence of events for the move.

    Improvements

    Network

    Prgmr.com’s founder Luke Crawford said he tried to keep things simple enough that he could debug them at 3am. We still agree with that overall, but we also would like to minimize how much debugging has to happen at 3am.

    At the previous data center, the networking was as simple as possible, which is to say that there was no redundancy. During this move, we added what we believe to be the simplest possible redundancy that would not require changing how all of our virtual machine networking works:

    • There are now redundant routers with independent BGP sessions and failover for gateway IPs.
    • By the end of the week there will be multiple upstream providers each coming into a different switch. Each router has a direct link to each of those switches.
    • Each virtual machine host and router has a bonded link in an active-backup configuration, so that if one switch goes down the connection should automatically fail over to the other switch.
    • The switches have redundant links to each other such that it should be possible for us to shut down any one switch without losing connectivity for our virtual machine hosts.

    We also made the following improvements:

    • We added a storage network for our virtual machine hosts that also uses active-backup link bonding. At the previous data center, we had a storage network but in a non-redundant configuration.
    • Internally, everything except for management interfaces are now physically 10Gbps.

    We made minimal changes to our upstream networking because there was a limit to how many changes we felt able to handle at once.

    Power

    At the former data center, we had only a single feed into each rack because redundant power is generally very expensive compared to the increase in uptime. But that meant that if there were any problems with any of the PDUs, it would be impossible to switch out the PDU without powering down the entire rack.

    We previously had vertical Avocent PDUs because they were supposed to have better logging capabilities. But every single one of them ended up with a broken management interface such that we could no longer remotely read power or control individual outlets. We switched to horizontal PDUs instead to compensate. But the rack layout ended up being extremely poor and difficult to work with.

    At the last data center our equipment initially had only single power supplies, though at some point we started buying servers with redundant power supplies.

    At the current data center we opted for redundant PDUs (not Avocents) plugged into independent power feeds. Right now we only have vertical PDUs. These PDUs are mounted on opposite sides of the rack near the back such that we should be able to unplug and replace one without powering off the rack. With the vertical PDUs it is easier to visually inspect the power cables to verify they are plugged in all the way than with the horizontal PDUs.

    All of our equipment required for our virtual machine hosts to operate, even the network equipment, has redundant power supplies. Each power supply is plugged into a different PDU.

    Servers

    Before Q4 2016 we were still hand-provisioning orders. Q4 2016 we started provisioning using a cluster management system called Ganeti. As part of this data center move we decommissioned the last virtual machine hosts which were not managed by Ganeti. Decommissioning all the legacy systems has removed some operational burden for us and should free up more time to work on other things.

    Paying customers all are running on local RAID10’s of SSDs (unless they have made other arrangements) and are on Ivy Bridge or newer Intel CPUs. As noted elsewhere, these servers also have a redundant link for public networking and a redundant link for storage/backend networking, which will give us more flexibility going forward in adding new features.

    Move timeline

    In December it took a few weeks simply to determine what network and power equipment we wanted and to purchase it. We bought at least one spare for all of our equipment at the same time we purchased the ones intended to go into production. We also had to plan out the racks adequately in advance to be able to buy the right lengths of network and power cables. Locally, only RJ45 network cables and basic power cables are easy to find, so we couldn’t just run to Fry’s if we didn’t have the right things.

    For transit, our second transit provider at the former data center didn’t offer service at the new data center and we were on a month to month contract. So we decided to look for a different transit provider.

    Having gathered transit quotes, we were very concerned about being able to start moving at the time we wanted to, which was at the beginning of January. Most of the transit providers were quoting us a minimum of 30 days to bring up service.

    Fortunately HE.net, whom we are under long-term contract with, was able to turn up transit at the new data center within a couple of days. We were very impressed with how fast HE.net was able to take care of this. Additionally, although we didn’t end up needing it, they allowed us to advertise networks smaller than /24 in case we needed this for traffic management during the data center move.

    For splitting our network between data centers, we set up an extra router at the original data center so that we could set up tunneling on it without modifying our primary router. At the new data center, both routers had tunnels for routed traffic and one router had tunnels for bridged traffic. For unencrypted traffic we used LT2Pv3 and for encrypted traffic we used either OpenVPN (slow, no MTU issues) or a GRE bridge over IPSEC (faster, had to be careful about MTU.)

    Equipment purchases and planning took more time than we expected, and so did the new network configuration, which cut into our budget for moving VPSes. We had originally thought we might migrate each VPS individually between data centers. But migrating individual VPSes would take longer than moving whole host servers. By the time we had everything ready to host at the new data center, there wasn’t enough time to migrate each VPS individually. There was a limit to the amount of complexity we could handle during the move, so we decided that most existing systems would have their drives moved.

    We still individually migrated VPSes off one host server because we wanted to change its storage layout. Consolidating off of our legacy host servers also happened by copying individual VPSes. For consolidating off legacy host servers we used lvsync to copy a snapshot of the data in advance and then shut down the VPSes only to copy the changes made since the snapshots.

    For moving the drives of host servers, we would set up a chassis at the new data center. Before shutting down the server at the old data center, we would run an Ansible playbook that configured the new bonded network interfaces and udev rules for the target hardware. After the drives were moved to the new chassis, the system was booted off the network so that there were no potential issues from trying to boot using a different hard drive layout. The GRUB bootloader can be extremely finicky and in the past, before switching to network booting, we sometimes had servers down for several additional hours while we tried to resolve issues with it.

    After the old chassis was vacated, we would standardize the internal layout and upgrade the add-on card firmware, BIOS, and BMC firmware before racking the chassis at the new data center. At some point we experimented with disabling the BMC by changing a jumper on the motherboard. But we found out that disabling the BMC also disabled the on-board serial port, and we require serial console access for our systems.

    Overall this went smoothly. There was one server for which there were drive issues when we first booted it, so we decided to switch it to a different chassis that had already been prepped for that night. We found the reason for the disk issue was that we had replaced a card after performing a firmware upgrade on the chassis rather than before, so one of the storage Host Bus Adapters had a very old firmware version. That was easily fixed.

    While our equipment was split between the two sites, there was a lot of extra traffic due to packets arriving at the original data center that needed to get sent to the new data center, and responses being sent back to the original data center before being routed back out to the general internet. To remove the traffic being sent back to the original data center that was destined for the outside world, we duplicated gateway IPs in both data centers and blocked ARP queries for the IPs in between sites. This mostly worked. Due to gratuitous ARPs when the new gateway IPs came online that weren’t initially blocked, we had traffic that could have all stayed at one site going through a gateway at the other site which mostly saturated our network. But we were able to fix it without major impact.

    Turning up BGP at the new data center and turning it down at the old one went without incident. Once we removed the gateway IPs and inbound traffic from the old router, we examined the remaining traffic. We determined that one customer had used a working but non-canonical IPv4 address for the old router as a gateway, and a few other people had used the link-local IPv6 address of the router as their IPv6 gateway instead of the global IP address. We notified those customers and all but one of them appeared to have fixed the issue before we turned off the old router.

    There were three unplanned service interruptions during this process. Two incidents, one with IPv6 router advertisements and another with taking down our own virtual machines have already been described on this blog.

    There was one more incident not already documented, where a single host server was powered off on accident. This server had dual power supplies. Both power supplies were plugged into adjacent outlets on a horizontal PDU using locking power cables. The cables were not plugged in all the way and someone pressed down on them while removing another set of cables. We found that the locking power cables not being plugged in all the way was a pervasive problem at the former data center, because it takes a lot more force to seat them and not everyone was aware of this. This problem has been fixed at the current data center.

    There are a few lingering issues to address. These include:

    • Updating documentation
    • Integrating applying downtime credits directly into playbooks bringing up hosts
    • Updating monitoring to alert on loss of redundancy rather than just total outages
    • Bringing up our second transit provider
    • Bringing up a redundant out-of-band console

    Overall we believe this data center move was a net positive and that we’ll be able to continue with improvements in 2019.

  • Early March 2019 Security Updates

    Wed, 06 Mar 2019 14:00:00 -0800 - Sarah Newman

    March 5th the Xen Project announced 10 new security advisories. The previous Saturday we live-patched the ones that applied to us.

    Of these, XSA-293 is probably the most interesting. Rather than a denial of service or privilege escalation between guests, it is an in-guest privilege escalation which applies to 64-bit paravirtualized guests only.

    The x86 architecture has different modes of operation it goes through during the boot process, somewhat similar to a developing organism that goes through stages resembling earlier evolutionary steps. There’s 16 bit “real” mode, 32 bit “protected” mode, and 64 bit “long” mode.

    x86 also has the concept of memory segments that were originally used to access more memory than could be directly addressed, but later included information about permissions. Memory segments are selected using memory segment registers.

    The 286 had four 16-bit segment registers: the code segment (CS), data segment (DS), stack segment (SS), and the extra segment (ES) used for string operations. These memory segment registers can be used either implicitly or explicitly.

    The 386 architecture added the FS and GS segment registers. The FS and GS segment registers are used only when explicitly referenced by an instruction. These FS/GS registers are typically used for thread-local storage.

    In real mode, the memory segment register contents are directly used to modify the address of a memory reference before it’s sent to hardware. In protected mode, the register contents are “segment selectors” that index into a table containing segment descriptors that have the permissions and size information. In long mode, the “base address” - the offset applied to a memory reference - is ignored for all the segment registers except for the FS and GS registers.

    While the FS/GS registers were originally 16 bits, for 64-bit processors there are “shadow” FS/GS registers - fsbase and gsbase - which are 64 bits. They can be accessed by using the model-specific register CPU instructions “wrmsr” and “rdmsr”. The wrmsr and rdmsr are privileged instructions and can’t be used from user space. On Linux, user programs may perform a system call arch_prctl with ARCH_SET_GS or ARCH_SET_FS to set these registers.

    On newer processors, it’s possible for user space to set these registers directly by using instructions specifically added to read and write the FS/GS registers. This will be faster for the user process than performing a system call.

    The operating system must set a flag FSGSBASE in a CPU control register to allow these new instructions. The operating system should not set this bit without also handling saving and restoring these registers across context switches. Currently, Linux does not do this though it looks as if support may be merged this year.

    The XSA is that for paravirtualized virtual machines, where Xen is responsible for managing the control registers for the guest, Xen always left the FSGSBASE bit on. It did not show this bit as on when the guest requested to know the control register state and it ignored writes to it. This meant that user-space processes could write to these FS/GS registers directly even though the guest kernel running in the virtual machine was probably not saving and restoring the register state across context switches. The fix is to start honoring and managing this FSGSBASE setting for each virtual machine.

    This vulnerability does not exist for hardware virtualized machines (HVM) systems as Xen is not responsible in the same way for managing the control register state for them.