• Early March 2019 Security Updates

    Wed, 06 Mar 2019 14:00:00 -0800 - Sarah Newman

    March 5th the Xen Project announced 10 new security advisories. The previous Saturday we live-patched the ones that applied to us.

    Of these, XSA-293 is probably the most interesting. Rather than a denial of service or privilege escalation between guests, it is an in-guest privilege escalation which applies to 64-bit paravirtualized guests only.

    The x86 architecture has different modes of operation it goes through during the boot process, somewhat similar to a developing organism that goes through stages resembling earlier evolutionary steps. There’s 16 bit “real” mode, 32 bit “protected” mode, and 64 bit “long” mode.

    x86 also has the concept of memory segments that were originally used to access more memory than could be directly addressed, but later included information about permissions. Memory segments are selected using memory segment registers.

    The 286 had four 16-bit segment registers: the code segment (CS), data segment (DS), stack segment (SS), and the extra segment (ES) used for string operations. These memory segment registers can be used either implicitly or explicitly.

    The 386 architecture added the FS and GS segment registers. The FS and GS segment registers are used only when explicitly referenced by an instruction. These FS/GS registers are typically used for thread-local storage.

    In real mode, the memory segment register contents are directly used to modify the address of a memory reference before it’s sent to hardware. In protected mode, the register contents are “segment selectors” that index into a table containing segment descriptors that have the permissions and size information. In long mode, the “base address” - the offset applied to a memory reference - is ignored for all the segment registers except for the FS and GS registers.

    While the FS/GS registers were originally 16 bits, for 64-bit processors there are “shadow” FS/GS registers - fsbase and gsbase - which are 64 bits. They can be accessed by using the model-specific register CPU instructions “wrmsr” and “rdmsr”. The wrmsr and rdmsr are privileged instructions and can’t be used from user space. On Linux, user programs may perform a system call arch_prctl with ARCH_SET_GS or ARCH_SET_FS to set these registers.

    On newer processors, it’s possible for user space to set these registers directly by using instructions specifically added to read and write the FS/GS registers. This will be faster for the user process than performing a system call.

    The operating system must set a flag FSGSBASE in a CPU control register to allow these new instructions. The operating system should not set this bit without also handling saving and restoring these registers across context switches. Currently, Linux does not do this though it looks as if support may be merged this year.

    The XSA is that for paravirtualized virtual machines, where Xen is responsible for managing the control registers for the guest, Xen always left the FSGSBASE bit on. It did not show this bit as on when the guest requested to know the control register state and it ignored writes to it. This meant that user-space processes could write to these FS/GS registers directly even though the guest kernel running in the virtual machine was probably not saving and restoring the register state across context switches. The fix is to start honoring and managing this FSGSBASE setting for each virtual machine.

    This vulnerability does not exist for hardware virtualized machines (HVM) systems as Xen is not responsible in the same way for managing the control register state for them.

  • February DOS and Service Interruption Post-Mortem

    Tue, 05 Mar 2019 17:20:00 -0800 - Paul Scott

    Incident on 2019-02-13

    On February 13, 2019 we suffered a service interruption during the transition between data centers. The direct effect on customers was minimal, but the interruption impacted resources such as our billing system and management console. The interruption lasted for 4 hours.

    Description

    We moved data centers during January and February. Our own equipment was the last equipment we moved, which hosted resources such as the website, the wiki, the blog, the management console and billing website.

    While we were split between data centers, we bridged some traffic between the two data centers. We have redundant routers but to avoid bridging loops, the tunnels for the bridges were only on one of the routers.

    A DOS attack interrupted the tunnels. Customer traffic resumed when the DOS was over but the tunnels remained down. We tried for half an hour to debug the issue and decided that since we were scheduled to migrate those services the following night anyway, to go ahead and do that immediately so that the tunnels were no longer necessary. Before powering down any equipment we ran a playbook to reconfigure the networking on the servers being moved because we were switching to bonded network interfaces rather than a single interface.

    After we moved our own servers we discovered that one of the switch configurations was incomplete. We also learned that the operation failed partway through when we reconfigured networking on one of the servers before moving it. Typically we do separate runs for each server, but this time we did them together. At least one of them succeeded, which initially hid the failure.

    Fixing that server’s networking took longer than it should have. At that stage in the move we had not configured a network port such that we could plug in directly to our infrastructure, nor had we gotten ourselves set up on the customer wifi at our new data center and the guest wifi was very slow. Once that was resolved, we realized had not finished NTP access for these servers and we wanted to do that before bringing up our own services.

    Resolution

    This was an unusual confluence of circumstances. In particular we are no longer using tunnels, so this sort of incident cannot happen again.

  • Scheduled maintenance for website, management console, billing, ticket system

    Tue, 12 Feb 2019 20:40:00 -0800 - Alan Post

    UPDATE: this maintenance was performed during an emergency maintenance window On Wednesday February 13th during an unrelated network outage. The scheduled maintenance here is no longer necessary and is canceled.

    For up to three hours on Friday February 15th starting at 04:00 UTC we will take our website, management console, billing, ticketing, email and internal systems offline to move them to a new data center. This is the last of our scheduled maintenance before we’re fully migrated.

  • IPv6 problems and support outage

    Sun, 10 Feb 2019 21:02:00 -0800 - Sarah Newman

    Update 2019-02-11 23:30 UTC: We discovered that for non-standard installs with accept_ra enabled, the router advertisement we sent to invalidate autoconfigured IPv6 addresses also invalidated statically configured addresses identical to the autoconfigured ones. We are emailing everyone whom we believe to be affected based on traffic counters and current address reachability.


    At about 21:40 UTC 2019-02-10, we accidentally enabled IPv6 stateless autoconfiguration for some of our network. This resulted in some hosts using the wrong IPv6 address, including our mail system, causing email to support to bounce. Our monitoring system didn’t alert us because the desired IPv6 still worked; it just wasn’t used for outbound connections.

    At 2:20 UTC a customer wrote a technical contact for prgmr.com (me) which doesn’t go through our support system, at which time we discovered the issue. We would have noticed the issue by the following morning as we have daily emails sent when our internal backups run.

    About an hour after being notified, the issue was resolved as best we could. Hosts that were originally affected but also had IPv6 enabled should be ok. Systems that did not have an IPv6 address by default, but didn’t disable IPv6 and didn’t disable router advertisements will still have an IPv6 address for up to 2 hours following. The extra address can be manually removed.

    We looked up which emails bounced to our support system and have written them. We also have resent emails sent during that time period since the IPv6 address used to was not white-listed for sending mail by SPF.

    IPv6 stateless autoconfiguration is a method for hosts to automatically assign themselves an IPv6 address and is used in place of DHCP for IPv6. There are three settings important here:

    • The autonomous flag, which declares whether the advertised subnet can be used for stateless autoconfiguration at all.
    • The preferred lifetime setting, which is how many seconds the autoconfigured address should be the default address.
    • The valid lifetime setting, which is how long the address can be used.

    We were moving these router advertisements in between routers. On the old router we deleted the autonomous-flag false setting but didn’t delete the router advertisement entirely until about an hour later. During that time this router was advertising that autonomously configuring an IP address was OK. It was also advertising a very long “preferred lifetime” of 604800 seconds (7 days) and a “valid lifetime” of 30 days. These are presumably the default settings. Addresses automatically configured using these settings would not go away after the autonomous flag was turned off until those times expired.

    To mitigate the issue, we temporarily re-enabled the autonomous flag and set the preferred lifetime and valid lifetime for the address to 10 seconds. As expected, the preferred time expired after 10 seconds. But for reasons explained in the RFC, the valid lifetime was reset to 2 hours rather than 10 seconds.

    After the preferred lifetime expires, if there’s a static IPv6 address, the static IPv6 should be used instead. However, systems without another IPv6 address would continue to use this address until the valid lifetime expires.

    To detect this problem in the future, we could have a canary system where we alert if it starts to respond on a given IPv6 address. We also at some time may switch to IPv6 router advertisements on a per-host basis, in which case this sort of issue would no longer be relevant.

  • Distributions updated: Fedora 29 and Arch Linux netboot installer

    Tue, 13 Nov 2018 10:00:00 -0800 - Chris Brannon

    We made two new additions to our distribution images and netboot installers:

    • Fedora 29
    • Arch Linux 2018.11.01 (netboot installer only, HVM only)

    Several years ago, we tried adding an Arch Linux netboot installer. At the time it started sshd on boot and allowed root login without a password. A test VPS running the installer was compromised and starting sending spam. After this incident we removed the Arch netboot installer.

    Arch has since fixed their install media. It does not start sshd by default at boot, and if sshd is started manually while the installer is booted, root login is not allowed until root sets a password.

    We really appreciate Arch Linux for taking these steps. In addition to the netboot installer, a pre-built image that can be automatically installed is also in the works. We’ll post an update when it is done.

    We are not deploying these to our legacy systems, so if you are on a legacy system (one whose management console lacks the “install new OS image” option) please write support@ to have your VPS moved.