Archive of posts with tag 'operations'
Fri, 10 Feb 2017 09:55:00 -0800 - Sarah Newman
Today Xen posted a security advisory regarding out of bounds memory access and QEMU for emulated Cirrus logic video cards. This was originally disclosed publicly on the oss-sec mailing list, meaning there was no embargo period for this advisory. We saw this when it was originally posted to oss-sec. This advisory does not lead to a privilege escalation for us.
Prgmr.com has both paravirtualized (PV) and hardware virtual machine (HVM) virtual machines. Only HVM uses QEMU. When we run QEMU, it is running within a device model stubdomain. A device model stubdomain is another virtual machine running at the same level of privilege as its associated guest.
Another way to avoid this vulnerability is to not provide an emulated VGA card. Xen has a “nographic” option but this is only suppresses the output and does not suppress the card itself. We could be using “vga=none”, and will experiment with this option to mitigate future video driver vulnerabilities.
Fri, 20 Jan 2017 01:20:00 -0800 - Sarah Newman
UPDATE 2:12 -0800: Connectivity with HE.net was restored as of about 2:08.
UPDATE 1:50 -0800: Apparently this was due to scheduled maintenance, but we’re not aware of receiving any notifications of such maintenance. We will do whatever is required to get the right email addresses subscribed for HE.net’s notifications, assuming one was sent.
Our connection with HE.net is down. Customers with an IP owned by prgmr.com had a brief interruption of about 3 minutes while this information propogated, but customers with an IP address owned by HE.net are still down. I assume one of their routers went out because the BGP status under lg.he.net shows no connections at all.
All customers have been allocated an IP owned by prgmr.com. If you received an email about this within the last few months, have not switched, and need help switching please write support.
Wed, 23 Nov 2016 17:00:00 -0800 - Sarah Newman
Yesterday Xen did public releases of eight different Xen Security Advisories. Most of these were found by the Xen Security Team or key contributers during a pre-release security assessment. All of these were patched on all our running systems before the embargo lifted.
- XSA-191/CVE-2016-9386 An in-guest privilege escalation. We had some vulnerable systems.
- XSA-192/CVE-2016-9382 A theoretical in-guest privilege escalation or DOS. No customer systems were vulnerable to the privilege escalation, because we don’t run customers on AMD hardware any more.
- XSA-193/CVE-2016-9385 A system-wide denial of service. We had some vulnerable systems.
- XSA-194/CVE-2016-9384 An information leak of host memory. We were not affected because we don’t run this version of Xen.
- XSA-195/CVE-2016-9383 Unprivileged processes within 64-bit systems could modify arbitary memory. This was the primary reason we patched our systems. All of our systems were affected.
- XSA-196/CVE-2016-9387,CVE-2016-9388 An in-guest denial of service. No customer systems were affected.
- XSA-197/CVE-2016-9381 Arbitrary code execution within QEMU. We mitigate this issue by running QEMU within a device model stub domain.
- XSA-198/CVE-2016-9379,CVE-2016-9380 Leak and removal of arbitrary file on the host file system. We are not vulnerable because we do not use pygrub.
We hope that the live patching functionality (similar to ksplice for Linux) will be production ready in the pending release 4.8. Apparently all the above XSAs could have been patched using this functionality.
All downtime notices were sent at least one week in advance. For all the reboots, one window ran 2 minutes over and all other servers completed within the scheduled maintenance window. After some additional automation made after the first night, the average time to complete from the start of the window was 50 minutes and the actual downtime average was 35 minutes.
We have some ideas for how to decrease the amount of downtime without a large infrastructure investment. The simplest method is to be less cautious about how soon we take down services once we begin the upgrade process.
We had only one request for the downtime window to be moved. We were able to avoid downtime completely for this customer because we had already brought them up on networked storage due to a different XSA. At this time we don’t have the spare hardware to do this for all customers, but there’s a pretty good chance we can do it on request for any customer using our latest management system.
Overall these upgrades went smoothly, but there were a few problems:
Some 32-bit domains wouldn’t come up after reboot even though technically there was enough RAM. We’ve come up with a workaround to keep this from happening again. The full problem description and our workaround will be in a separate blog post.
It was taking far too long to shut down services. We traced this to a backlog in xenstored from shutting down a large number of services in parallel. Our solution was first to wait a little while in between shutting down each individual service and then put a cap on the size of /var/lib/xenstored in order to limit the amount of backlog. This took us from waiting over 40 minutes and giving up (in two cases) to shutting everything down within about 10 minutes.
One server hadn’t been interactively rebooted since installation and it took a while before we realized that the serial console settings were incorrect.
One one machine four services were left down for an extended amount of time because a post-update procedure was not followed. We solved this by putting the majority of the post-update process within an ansible playbook to make it easier to follow.
Time was way off on one machine because the post-update procedure was not followed. This was also fixed with the same ansible playbook.
Tue, 22 Nov 2016 04:00:00 -0800 - Sarah Newman
We are not vulnerable to any of the recently released Xen Security Advisories (XSAs.) The most critical of these was XSA-195.
Thu, 10 Nov 2016 14:28:00 -0800 - Sarah Newman
The root cause of the downtime on November 6th was a bug in HE.net’s router software. HE.net is one of our transit providers. When the bug occurs, network connectivity is lost in one direction between their routers, resulting in packet loss to some portions of the internet. Technically this was as a partial outage, as many west coast routes (including my home ISP) were still working.
HE.net tells us that there will be a software update sometime after November 30th to fix this issue. Our services should not be affected during that update except for any users still on IP addresses not owned by prgmr.com. All customers using these IP addresses have already been notified of the issue and were given instructions on how to switch to prgmr.com IP addresses a couple of weeks ago.
The downtime on November 6th lasted much longer than we would have liked. Here is a breakdown of the timeline and some mitigating steps we’ve taken:
- At 00:52 -0700 I was first paged. For some reason I did not hear any of the four texts or phone calls. My landline was called but there is no phone in the room I was in. I am planning to add a landline to that room.
- At 01:06 -0700 the pages were passed to the secondary person on pager. This person did not have sufficient training to isolate this particular network issue. We have discussed the problem and also updated our pager documentation to handle this particular issue. There was also not a hard time limit on how long to spend diagnosing the issue before escalating to another staff member, which we have now added.
- At 01:33 -0700 I happened to wake up and notice I had been paged. Upon investigating, I figured out that the problem was likely a duplicate of the issue we were having with HE.net previously and decided that the best course of action was to shut down our connection with HE.net. But I had to look up how to do that with our software. This has been added to the pager notes. We’ve also signed up for an additional monitoring service that automatically generates traceroutes from a number of test sites which should cut down some on debugging time.
- 1 hour and 17 minutes after the first page, the problem was fully resolved.
Sun, 06 Nov 2016 01:28:00 -0800 - Sarah Newman
We started experiencing intermittant packet loss around 00:47 -0700 PDT. At 1:05 -0800 PDT we brought down our peering session with HE.net and are running on our secondary connection only. We are trying to get our connection with HE.net debugged so that we can turn it back on again.
Fri, 28 Oct 2016 12:00:00 -0700 - Sarah Newman
Our Linux pre-generated images used for new installations, as well as the Linux rescue image, are no longer vulnerable to CVE-2016-5195 (otherwise known as Dirty COW.)
The following new install options have been made generally available:
- Beta pre-generated images for FreeBSD 10.3 and 11.0 are available for HVM systems. They are beta because there are some minor tweaks outstanding and we have not yet added IPv6 support, but they should be usable. FreeBSD can also be installed manually from option “6. set bootloader or rescue mode” of the management console. If you are running an older system and would like to install FreeBSD, please contact support.
- Ubuntu Yakkety 16.10 has been added. There are some known issues running in paravirtualized (PV) mode with multiple VCPUs. If you have an older system with multiple VCPUs and would like to install this, please contact us about migrating to HVM mode.
- Debian Jessie with SysVinit pre-installed instead of systemd has been added. While it’s possible to modify the existing image to add SysVinit, we’ve had reports that modifying /etc/inittab was not automatically handled and serial console access was lost.
- NetBSD 7 has been updated to 7.0.2.
- Fedora 24 is available generally and not just for newer systems.
As usual, the source can be found at github.
For existing installations, if you haven’t updated your kernel yet or applied a mitigation for CVE-2016-5195, we recommend you do so. There were some reports previously that CentOS 5 and CentOS 6 were not vulnerable, but this is only true of one of the two types of exploits. Here are links to more information for distributions we’ve at least partially supported:
Mon, 17 Oct 2016 08:35:00 -0700 - Sarah Newman
UPDATE 10:22 -0700 PDT: Our primary DNS server is back up. This was due to a network outage on another provider outside of our primary infrastructure.
Our primary DNS server is down; the two secondaries are still up. Customers will be unable to set their reverse DNS and paid orders will not be provisioned until this is fixed. We’re working on resolving the issue but don’t have an estimate of when that will be complete.
Fri, 09 Sep 2016 20:35:00 -0700 - Sarah Newman
UPDATE 02:37: Our BGP session with NTT is up.
UPDATE 01:22: Further information from HE.net was that they received a BGP cease message. It hasn’t been specified yet if that was related to our router reboot. We have added additional logging on our end that should show more details in the future. The problem occurring and clearing up did not correlate with our router reboot.
We have set up our side of the NTT link but NTT tells us they have not set up their end. We don’t have an ETA for when that will be in place. It is not clear though that having another upstream would have helped in this particular case.
There was a network outage starting about 20:25 PDT -0700. We don’t know root cause except that it was likely on HE.net’s end.
This is on us as we had a NTT uplink provisioned but not set up as we wanted to schedule potential downtime a decent amount in advance and other sources of downtime, most notably the XSAs, kept using up our budget for the month. Tonight it is going to get set up regardless.
Thu, 08 Sep 2016 08:00:00 -0700 - Sarah Newman
Today Xen did public releases of four different Xen Security Advisories.
- CVE-2016-7154/XSA-188 affects only Xen 4.4. We do not have any customer systems running this version.
- CVE-2016-7094/XSA-187 is for HVM virtualization, which did not apply to any customers at the time, but has been patched. It also did not apply because we use hardware assisted paging (hap) and not shadow page tables.
- CVE-2016-7093/XSA-186 is for HVM virtualization, which did not apply to any customers at the time, and also did not apply to the version we’re running.
- CVE-2016-7092/XSA-185 was the most critical for us. It is a priviledge escalation bug that is only exposed to 32-bit paravirtualized (PV) guests. We either patched the host server or moved affected guests off of an unpatched host server to a patched one. 32-bit bootloaders have been removed from host servers that have not been patched.
Guests were already prevented from changing to 32 bit mode on existing systems in July. With Xen, 32-bit PV guests can only boot from the first 168GB of RAM. So on host servers with more than 168GB of RAM, guests switching from 64 bit to 32 bit mode could leave free but unusable RAM.
Unfortunately we did not meet our deadline of not being vulnerable to the XSAs before the embargo lifted. What happened is that two guests running grub2 32-bit were left running on unpatched hosts until about 8 hours after the embargo lifted. I noticed when reviewing the output of xenstore-ls, which lists the live boot parameters of guests. They were moved to patched hosts and given a full months credit for the downtime. We have no reason to believe the exploit was exercised by either of these guests during those 8 hours.
There were at least two reasons this initially happened. One, the initial query for 32 vs 64 bit was copied from our old initial provisioning (which defaulted to pv-grub) and wasn’t adusted to allow for grub2. This was a mistake of inattention. Second, guests not called out as 32 bit were assumed to be 64 bit when there should have been explicit checks for both 32 bit and 64 bit - explicitly calling out both would have left some guests unclassified and exposed the error. I considered doing that, but didn’t because it would have been extra delay and we were already very late sending out downtime notices compared to previous reboots.
We sent downtime notices two days after first learning of this vulnerability. This time, since we were doing so many moves, some notices were only sent 3 days in advance. Everyone had the option to reschedule to a different time. The whole server reboots all finished within the originally scheduled time frame but a few of the moves ran over.
We also brought up our customer cluster running an internal ganeti fork and most of the guests moved were imported to this system. We are still finding bugs and there are a couple of unfinished items, but in general it seems to be working.
We also started defaulting Linux VMs to HVM mode, which is typically faster for 64 bit operating systems. HVM guests are segregated from PV guests so that PV guests do not have to be rebooted for HVM-only XSAs and vice-versa.
We were reluctant to run HVM mode for a long time due to not wanting to run QEMU in the context of the dom0(host server.) Therefore, we are using device model stub domains, which runs QEMU inside of a PV guest instead of on the dom0. Most Linux guests use PV-on-HVM drivers and do not use the emulated devices from QEMU after boot. However, NetBSD doesn’t have PV-on-HVM support and the performance was abysmal in HVM mode with device model stubdomains because it was using ATA PIO mode for the virtual hard drive, which is very slow. Because of this we are still defaulting NetBSD guests to PV mode.
Wed, 07 Sep 2016 17:30:00 -0700 - Sarah Newman
UPDATE 2016-09-08 00:31 -0700 PDT: This maintenance is completed.
– We will be taking down most of our core services, such as the billing server, at some point between 23:00 and 1:00 Wed Sep 7 for maintenance. User services will continue to run, but you may be unable to access the management console. There will be another post when the maintenance is complete.
Wed, 27 Jul 2016 14:00:00 -0700 - Sarah Newman
Yesterday Xen did public releases of a couple of different Xen Security Advisories. The most critical one for us was CVE-2016-6258/XSA-182, which is titled “x86: Privilege escalation in PV guests”. All systems hosting customer administered systems were patched for this vulnerability at least 24 hours before the embargo lifted.
We sent downtime notices the day after first learning of this vulnerability. Notices were sent at least a week in advance of any downtime, and customers had the option of rescheduling the downtime to be earlier.
Overall these reboots went relatively well, despite our founder Luke being unable to work due to a recent hospital visit. None of of the downtimes exceeded our original plan, but in absolute terms we went over by up to about half an hour in a few cases due to starting late.
That being said, there were still a few problems:
- We have one oddball server with two HBA (scsi expansion) adapters rather than one. The server would not boot until the drives were rearranged within the chassis - either different bays or a different drive order fixed it. It did not look like a GRUB problem.
- One server ended up having an extremely slow disk after reboot. I tried to add an external bitmap before temporarily kicking one of the drives and accidentally deadlocked the server by putting the bitmap on top of the RAID device the bitmap was for. The command would have worked with other servers but not the given one due to the root partition being on a different device. I don’t think there’s a prevention strategy here except better adhering to KISS.
- A playbook that shut down services was accidentally run on a server that had already been upgraded, leading to about 10 extra minutes of downtime for around 20 services. It may be possible to prevent this in the future with environment variables, but it depends on people’s individual workflows.
We also put additional hardware into production and gave a RAM upgrade to around half of the VMs. We don’t have concrete plans for upgrading the rest of the VMs yet but hope to do so within the next few months.
Sat, 02 Jul 2016 11:53:00 -0700 - Sarah Newman
Starting at about 11:40 PDT -0700 and ending at about 11:43 we experienced a DDOS. This was due to lifting the ban on the IP from last night once the customer was about to do a full reinstall. They were running an old copy of wordpress that likely had been compromised.
Fri, 01 Jul 2016 20:05:00 -0700 - Sarah Newman
Starting at about 19:51 PDT -0700 and ending at about 20:03 we experienced a DDOS. It appeared to primary be spoofed IPs making DNS requests. The targeted IP has been blackholed.
Sat, 11 Jun 2016 15:29:00 -0700 - Sarah Newman
Starting at about 14:45PM PDT -0700, ending at 14:50, then starting again at 15:14 and ending at about 15:22 we experienced a DDOS. It was UDP packets targeting both port 53 and port 22. It did not appear to be valid traffic. The targeted IP has been blackholed.
Sat, 07 May 2016 21:02:00 -0700 - Sarah Newman
The pre-generated distributions and pre-downloaded netboot installers were just updated. While the primary reason was to add Ubuntu Xenial, there were also the following changes:
- Ping can be used as a non-root user on CentOS 7, Fedora 22, and Fedora 23. These distributions use capabilities rather than setuid to give the proper permissions to non-root users for ping. GNU tar (the default) does not preserve these extended attributes, so we switched to BSD tar which does.
- ext4 is the default file system.
- grub.cfg, needed for grub2, is present on everything other than CentOS 6.
- Ubuntu precise is no longer available as a pre-generated image, but can be installed still using the netboot installer.
As usual, the source can be found at github.
Sun, 03 Apr 2016 15:20:00 -0700 - Will Crawford
resolver01 (220.127.116.11) is down temporarily. If you’re experiencing large timeouts, please switch the order of resolvers in /etc/resolv.conf. This post will be updated with more information later.
Wed, 16 Mar 2016 22:23:00 -0700 - Sarah Newman
We had an outage from approximately 21:31 PDT -0700 to 22:16 PDT. Currently we have a cross connect from coresite to HE.net in another facility. We are told that coresite had problems with their core switches. We have gathered quotes for alternate bandwidth but had not made a final decision. We will confirm beforehand that whatever cross connect we get for this additional bandwidth will not be dependent on coresite’s network infrastructure.
Wed, 17 Feb 2016 23:10:00 -0800 - Sarah Newman
The distributions and rescue image have been updated to incorporate fixes for CVE-2015-7547. This would have been sooner except that fedora did not push updates until this morning.
Here are the links to the relevant security information:
Tue, 16 Feb 2016 21:27:00 -0800 - Sarah Newman
UPDATE 2016-12-17 00:06 PST -0800: We are told the maintenance window is closed.
UPDATE 22:04 PST -0800: I think they must have not done the update correctly the first time because there was another ~8 minutes downtime from 21:55 to 22:03. We have not been advised that the maintenance window is closed yet.
We had a network outage of approximately 8 minutes tonight from 21:13 to 21:21 due to an upgrade to the core router software for one of the data centers we have a presence in. There will be a similar outage two days from now.
Fri, 29 Jan 2016 13:00:00 -0800 - Sarah Newman
Here is the report from our upstream on the recent network outages. Please note we received this a couple of days ago but it had a confidentiality clause so we waited to hear back on how much could be shared before posting. We’re also trying to track down some transient packet loss.
January 20, 2016: At approximately 13:20 PST, utility power was interrupted, and the building transferred successfully to the back-up power source. At approximately 16:30 PST, utility power was restored. During restoration, the UPS experienced a fault causing it to shut down. The UPS was manually put into maintenance bypass and service was restored. A service call was placed to the UPS vendor and technicians dispatched. As a diagnostic step, the battery string was replaced.
January 25, 2016: At approximately 12:10 PST we attempted to bring the UPS out of maintenance bypass resulting in the UPS shutting down during the cutover. The unit was placed in maintenance bypass and service was restored shortly thereafter. We engaged the UPS vendor for assistance in troubleshooting.
January 27, 2016: The UPS was inspected, diagnostics run and the main fuse replaced. At approximately 14:30 PST an attempt was made to bring the UPS online. The UPS shut down during the cutover. The unit was placed into maintenance bypass and service was restored.
They say there will be a separate maintenance announcement for replacing the UPS.
Wed, 27 Jan 2016 14:43:00 -0800 - Sarah Newman
We had a network outage of 6 minutes from about 14:33 to 14:39 -0800. The last word from HE.net was that they would get us a report from PGE hopefully by the end of the week. After the outage this Monday, we started collecting quotes for additional transit and hope to come to a decision by the end of this week.
Mon, 25 Jan 2016 12:24:00 -0800 - Sarah Newman
UPDATE 17:38 -0800: HE.net says they are still investigating. The other BGP peers for 55 S. Market have similar uptimes to ours so evidence points to something at 55 S. Market.
We had a network outage of 6 minutes from about 12:12 to 12:18 PST -0800. The symptoms were quite similar to before. I will post more details as they become available.
Wed, 20 Jan 2016 17:26:00 -0800 - Luke Crawford
apparently there are power issues in the he.net portion of 55 s. market
The coresite notice looked something like this:
Our utility power provider has updated our estimated utility restoration time to 1715 PST. CoreSite Engineers remain onsite and will continue to monitor all systems.
Thu, 14 Jan 2016 13:56:00 -0800 - Sarah Newman
Update 15:56 -0800 PST: Our upstream says the router maintenance is complete.
Our network connection is occasionally dropping. I will post updates as they become available. From our upstream:
We are working on replacing one of our core routers in San Jose which failed early this morning. Engineers and replacement equipment are on site. You may continue to experience brief connection interruptions until work is complete. We do not at this time have an ETA for completion of the repair.
Tue, 12 Jan 2016 00:58:00 -0800 - Sarah Newman
UPDATE 02:15 : it looks like the power supply is bad. The drives were moved to a comparable chassis and all the instances are back online.
We do not have any error messages. Looking into it.
Mon, 04 Jan 2016 15:30:00 -0800 - Will Crawford
https://billing.prgmr.com will go down briefly tonight around 9pm for maintenance.
Thu, 17 Dec 2015 10:07:00 -0800 - Sarah Newman
As you may be aware already, Xen recently did public releases of several Xen Security Advisories . As a public service provider, we believe it is a requirement to be fully patched for any critical vulnerabilities for public release. Specifically, we will do upgrades for information leaks and for privilege escalation vulnerabilities, but not DOS vulnerabilities, since currently we would be disrupting service to provide them.
This particular round of reboots was primarily due to XSA-155. XSA-155 can lead to privilege escalation, which matches one of the advisories we will perform upgrades for.
We barely met our goal to be fully patched by the time of public release - the last round of VMs came up about 5 minutes before the embargo deadline. We aren’t sure if this is terrible or perfect planning since we gave as much notice as we could before doing the reboots.
Here is a too detailed account of what happened:
Friday December 4th: The first notice for XSA 155 was received this morning. I reviewed our current set of servers and came up with a upgrade schedule. The schedule was structured with the following goals:
- Customers would receive at least one week’s notice that they would be rebooted.
- There would be one day of not very many reboots in advance of the bulk of the work such that we could work out updates to our ansible playbooks and procedure when only one server was down at a time, and that this would be on a day which was not during business hours for most people (IE Friday night.)
- These test upgrades would address at least one of each operating system we ran at the time, specifically CentOS 5 and CentOS 6.
- The upgrades we believed could be the most problematic were to be scheduled for the weekend.
- The majority of reboots would occur on the weekend for the majority of our customers - most of our customers are in the US. So the majority of the updates happened on Saturday and Sunday. These started at 20:00 PST and were originally scheduled to end at 2:00.
- Any upgrades that had to occur during the week would be at night for the majority of our customers - this ended up originally being 21:00 and 23:00.
Of course this changed, as I will get into later.
The above schedule was completed the same day. All initial notifications emails were sent by 20:00 PST -0800, starting with reboots on December 11th.
We found out later some people did not see this email. We don’t know if the email simply got missed or if it went to spam. Please white list prgmr.com and add any extra emails you want to get notified as technical contacts, see our wiki page on managing contacts .
Also on Friday, an email was sent to CentOS requesting any updates to packages so we could start from the latest.
Saturday 5th and Sunday 6th: spent trying to get the next version of the distributions finished (was not able to in time)
Monday 7th: Sanity day
Tuesday 8th: We were initially warned that there were circumstances under which the patches wouldn’t be effective, specifically if a compiler had gcc bug 58145. Since there was no CVE for the bug and no good way to know which versions were affected, I went back to the original test case for that bug and ran it for the CentOS 6 and CentOS 5 compilers. I am not terribly familiar with x86 assembly so it took me a while to go through each line and understand what was going on, but I confirmed that the compiler bug was not present for CentOS 6 and CentOS 5.
Wednesday 9th: Reviewed what XSAs were applied to CentOS 6 to try to confirm there would not be any regressions. I worked backwards and mostly trusted the changelog. There was one XSA not applied (142) but it does not impact us since it is for file backed storage. I sent an email to the CentOS-virt sig informing the group. For the XSA review, I went back as far as XSA 60 but at some point they did a complete import from vanilla xen and I decided that was probably far enough.
I applied the patches to the latest Xen4CentOS package and rebuilt. It was first loaded on an HVM dom0, during which we found that oxenstored did not work properly with xl console so that change got added to our ansible roles. It next got loaded on our test box.
Thursday 10th: I reviewed the XSA history for CentOS 5 and discovered that Redhat had not patched some CVEs that we performed reboots for back in February and March of this year - most specifically CVE-2015-2151. We had regressions on two machines that were updated to the latest CentOS 5 due to this. I also looked at backporting the patches for XSA155 and found that the code differed so much that I did not have any confidence in my ability to correctly patch the code.
At this point I didn’t see any option but to upgrade to CentOS 6 from CentOS 5 (we’ve already been running CentOS 6 in production for a year) which was a significant change in strategy. However, I had done some work on this back in December 2014 and it had mostly been working, so it was not a stretch to change plans this late.
Doing the CentOS 5 to CentOS 6 change required booting into rescue mode. This was practical for all of our machines except for the two in Sacramento that were already scheduled to be decommissioned. We made the decision to take a server up there already with CentOS 6 installed (at least that was the plan) and then move the drives from the old machines to the new ones. The first of the ones in Sacaramento had initially been scheduled for Friday night but we weren’t going to be ready, so I rescheduled from Friday to Saturday night. Since Luke was going to have to slog up to Sacramento etc. and I was probably going to need to help with the migration, I rescheduled two more updates from Saturday night to Tuesday and Wednesday night at 21:00 -0800 PST respectively. These were all sent with at least 24 hours notice. We started the machine slated for Sacramento memory testing.
Friday 11th: Because there was not enough time to test but Friday is probably the best time for downtime, I didn’t want to reschedule to a different night. So I sent an email to the CentOS 5 customers for that night letting them know I didn’t know exactly when the maintenance was going to start or end. This happened about 7.5 hours before the window was scheduled to start.
I worked through changes to our playbook for updating CentOS 6 with our test machine. Since there had been other reboots for XSAs with CentOS 6 quite recently I did not do a full run. I ran through some of the procedure for the CentOS 5 to CentOS 6 upgrade with an HVM dom0 but was not able to complete this before the scheduled time for the CentOS 6 upgrade. The CentOS 6 upgrade went well. After this I booted a decommissioned CentOS 5 host and went through the full procedure for the upgrade. The upgrade for the production CentOS 5 server started about 3 hours late (~ 1:00 -0800 PST Saturday) and ended about about 2 hours late (~ 2:00).
Saturday 12th: I got most of the way through installing CentOS 6 on the new Sacramento server, but was not able to finish by the time Luke needed to catch his train.
Saturday night did not go as well. Here are some of the problems encountered:
- Difficulty breaking into bios to do netboot due to lack of function keys on chromebook
- Changes in names of network devices between CentOS 5 and CentOS 6. On subsequent nights this was avoided because we added a template for /etc/udev/rules.d/70-persistent-net.rules to the ansible roles.
- chroot script did not unmount things in the correct order for all hosts and it was overly aggressive about cleaning up after itself, so the work for at least a couple of servers was lost and had to be redone. The script was fixed that night.
- The procedure was not as well documented as it should have been and Friday night wasn’t used as a training run, so people missed some steps and misunderstood some steps.
- People running with a more modern version of ansible that implemented the yum module differently such that a repository was not getting installed before use.
- On one server, the serial console text turned black during the boot process so we initially thought a server wasn’t booting when it was.
- On one server, there was a domain that should have been canceled but wasn’t such that the last domain to boot couldn’t because there wasn’t enough memory, and we didn’t notice until someone complained. We had had checks for things like this before and they got added back in for subsequent nights.
- Our ansible playbooks had some hacks for older servers that were originally conditional on CentOS 5 but shouldn’t have been any more and we didn’t discover this until customer reports the next day of not being able to access the management console. It was fixed as soon as we found out about it.
- A number of VMs failed to boot for problems we did not have time to look into, but we sent emails about it. More on this later.
On average this ran an hour over the scheduled maintenance window. For the Sacramento servers, we had to finish installing CentOS 6 and also had to write/test the playbooks to migrate the customer data. The actual downtime for the Sacramento customers was probably less than half an hour but it occurred about an hour and a half after the scheduled maintenance window end.
Sunday 13th: We got paged and spent some time on network issues due to another HE.net customer being DOS’ed. This is significant only so far as it interfered with other work. There were new problems due to:
- ssh taking a while to die, such that we had to start manually killing it.
- A bug in an ansible playbook had been fixed a while ago, but the buggy changes it originally made weren’t, such that some xen configuration files were invalid.
- Different bios versions having different navigational methods.
- Unusual raid configurations due to us adding on disks later combined with drive letter changes meant assembling the raid by hand was difficult, and –assemble –scan didn’t always quite work.
- For some servers we encountered kernel messages related to mptsas: swiotlb full under heavy disk load. One server was down for an additional two and a half hours beyond what was originally scheduled while we tried to make it go away. Research eventually suggested this is a performance problem only, at which point we decided to let it be. One host server that was upgraded early is occasionally throwing this error, but we picked a larger value for subsequent hosts and for these we aren’t seeing problems under normal operation. We intend to investigate this in further detail later, as if and when we start using SSDs it will probably become a limiting factor.
- Save/restore of memory (equivalent to hibernation) by xendomains wasn’t being disabled as it should have been and for one server that had to be rebooted to increase the value of swiotlb, it started saving the VM memories which ended up filling the disk. I tried to make it so that the machines that were saved would be resumed again after the reboot, but the change we originally intended (to disable save/restore) was in place so those domains weren’t restored. This meant those 5 or 6 VMs were uncleanly shut down. Disabling of save/restore was fixed in our ansible playbooks for subsequent runs.
- One server that had been switched between chassis didn’t have the right mac address in its DHCP entry. This is one thing that hasn’t been templated.
- One server had its ip6tables rules written out while it only had an ipv4 address, so ipv6 wasn’t working for it after reboot though it worked for the VMs. We removed the ipv6 AAAA records for it instead of trying to fix the ip6tables rules.
All of the servers except the one that run 2.5 hours over, ran one hour over the originally scheduled window.
Monday 14th: We got more concrete reports about VMs failing to boot under Xen 4. The worst case was the person who was running Fedora 17 as there is no upgrade path. We ended up giving them a clean install of Centos 7 and attached their old block device as an extra disk.
We had previously moved at least a couple of hundred VMs from Xen 3 to Xen 4 so this was a surprise to us. Unfortunately we didn’t see a way to stay on Xen 3 and fix the XSAs. Possibly we should have left one box as CentOS 5 with the understanding it was going to be vulnerable, but we didn’t have time to set up a new server as we didn’t want to use an existing server with other customer data on it.
We wanted to delay the schedule to give people more time to upgrade the kernels, but this would have made it even more difficult to be patched by the time the embargo was lifted.
We wrote wiki pages on accessing your disk from the rescue image and guest kernel changes for xen 4 and sent an email to the remaining people who were going to be upgraded onto Xen 4. It was very frustrating because people would write us asking if their kernel versions would crash and we couldn’t really answer because we didn’t (and don’t) know what the underlying problem was, plus there are too many kernel versions out there. This is the downside to running your own kernel - we can’t test every version. In the future we will consider an option for us to manage people’s kernels instead of using pvgrub(2).
The upgrades for Monday night all completed within their window.
Problems this night:
- One server had its upgrades completed within the maintenance window but I fat fingered my password without noticing and the emails about the server going up weren’t sent.
- One server had the module line for initramfs missing in grub.conf (a known problem with centos) and since we haven’t gotten to adding a playbook entry to fix it, we’ve been hand fixing it when it happens. But there are differences in the partition layouts and the wrong version was being used such that the initramfs was not being found, which delayed the upgrade completing.
- One server had a root partition much smaller than normal and there wasn’t enough room to install a new kernel onto it, so space had to be cleared out on it.
- One server, after being rebooted into the CentOS 6 rescue, was throwing warnings about some physical volumes for LVM not being found. One partition that had originally been in the raid was also not marked as being a raid component after reboot. From the CentOS 5 root file system the physical volumes were identified correctly. Since continuing with CentOS 6 seemed like a risky idea, we rebooted into the original CentOS 5 and used our move playbook to move VMs individually from this server to another server that had enough room. I considered other methods but the move playbook is well tested and I knew it would work, even if it was slower than it could have been by doing a more direct data transfer. People on this server had up to an extra 8 hours of downtime, though it was in the middle of the night for most of the customers and we tried to prioritize moves based on time zone. We have given them a months credit already.
- Due to this one server being extremely problematic and requiring attention from multiple people, another server was rescheduled from Tuesday to Wednesday night with less than 24 hours notice.
Two of the servers were completed on time, one was an hour late, and as mentioned one was up to 8 hours late.
- When a drive was replaced on a CentOS 5 host during the maintenance window but prior to it being upgraded, there was a null pointer bug in the kernel and it spontaneously rebooted, so all those people were shut down uncleanly. We hope the same problem does not exist in CentOS 6.
- The pxe boot menu was generally working with a serial console number of ‘0’ even if the linux serial console number was something other than 0. On one host it was the not the case and it took a number of reboots before isolating and fixing the problem, since none of the other hosts (even ones hypothetically of the same configuration) had the same issue.
- Someone forgot to start the VMs after completing the upgrade so there was about an additional hour and a half of unecessary downtime for most people on that host server. We have “canaries” that should be in our monitoring but nobody has gotten to it. It should be easy to generate the configuration files for them.
- One host was going OK during the upgrade but spontaneously shut down during the rescue image portion of the upgrade. I went and looked at the chassis and it had a red blinking light for health. I dug a little but it seemed best to move the drives to a new chassis. After ironing out a few issues with netboot it went smoothly, but the it ran over the maintenance window by about 1.5 hours due to the hardware failure.
- The final upgrade went smoothly - it took about an hour and 15 minutes - and completed about 5 minutes before the embargo lifted. This was the one that was rescheduled from Tuesday night. I made the window 4 hours instead of 2 because I wanted to do it as soon as the previously scheduled upgrades had completed.
The public facing server for our own infrastructure was upgraded in about half an hour.
We intend to give a proportional credit (IE 1/744 of a months payment for 1 hour of downtime) for all of the maintenance outside of the originally scheduled window with the exception of the server where we already gave a months credit. This should be completed within the next week.
We need to investigate the mptsas/swiotlb kernel issue more.
We need to automate changing people’s public keys. This has been a TODO for a long time but usually support is light enough we are able to respond in a reasonable amount of time. But during this, it was very urgent for some people and we were extremely short staffed so sometimes it did not happen in time and people suffered. But there was a hard deadline and so we had to prioritize the needs of the majority of our customers. I highly recommend people check out GPG smart cards or possibly yubikeys so that their public key is not tied to a particular computer.
We should add the already allocated VM canaries to our monitoring.
We intend to write a follow up post discussing the different strategies for decreasing downtime. It won’t be something we can implement in the short term for all our hosts. However, now that all of our servers are running the same operating system, and one that is more modern, this should help us a lot with maintenance and long term goals.
Sun, 13 Dec 2015 13:11:00 -0800 - Will Crawford
UPDATE 16:05: it appears that our network outage was due to a large DDOS on another customer of our upstrem. They say that the DDOS is now mitigated.
We’re currently experiencing some issues with our network and we’re working to resolve them.
Sat, 12 Dec 2015 18:45:00 -0800 - Sarah Newman
Update 19:18 : wiki.prgmr.com is back up.
The wiki will be down for up to an hour. I’ll update this entry when it’s back online.
Tue, 24 Nov 2015 15:20:00 -0800 - Sarah Newman
If you are running jenkins, please update it to the latest version and check your logs (including your mail log) for any suspicious activity. A zero-day in jenkins was patched on November 11. If you believe you have been compromised and want help reinstalling, please contact support.
Thu, 29 Oct 2015 10:30:00 -0700 - Sarah Newman
Update 16:01 PDT: In case it wasn’t clear, we are on the pre-disclosure list and all of the reboots to apply patches to public facing machines happened before public disclosure.
Here is a rundown of our vulnerability and response for the xen security advisories released today:
- xsa-145 - Arm only, not affected
- xsa-146 - Arm only, not affected
- xsa-147 - Arm only, not affected
- xsa-148 - This was patched in affected public facing systems. This is high impact - “malicious PV guest administrators can escalate privilege so as to control the whole system.” While the minimum vulnerable version is specified as xen 3.4, I still reviewed the commit specified as being the source of the vulnerability in the attached patches, the patches supplied with the original XSA (which do not apply to Xen 3.4), and additional patches later sent out by someone against 3.4 to verify that the remainder of our systems were not affected.
- xsa-149 - Not vulnerable because we use xl.
- xsa-150 - All publicly facing systems with HVM guests have been patched, though the only such systems are test systems wholly used by prgmr.com.
- xsa-151 - Patched in the affected public facing systems. This is a denial of service attack that we might have caught due to a job that runs each night to find, shutdown, and notify us of rebooting guests.
- xsa-152 - Patched in the affected subset of the systems also patched for xsa-148 and xsa-151, not patched in the remainder of the systems. Since the result is a denial of service attack and not a privilege escalation, we will address this as needed, as patching it would have led to loss of service as well. Systems that exploit this vulnerability will be apparent from the log messages.
- xsa-153 - Not vulnerable because we do not use HVM guests, and if we did we still still not be vulnerable - we would not use memory populate-on-demand as we do not oversubscribe ram.
Mon, 26 Oct 2015 11:18:00 -0700 - Sarah Newman
The following services will be taken down for maintenance Tuesday the 27th at 22:00 PDT -0700:
- Our ticketing system
I’m allowing 3 hours for the maintenance to complete.
Sat, 24 Oct 2015 10:58:00 -0700 - Sarah Newman
Update 16:34 -0700: The reason why this happened is that someone, when setting up the nagios monitoring, used default values from some example. It turns out we were not paging until we got to 60% packet loss. Warn at 2% loss and page at 6% loss is way more reasonable. Luke set the nagios thresholds to page when the packet loss exceeds the new, lower thresholds.
Update 13:25 -0700: ipv6 was disrupted at the time we switched ports. ipv6 connectivity should now be restored.
Update: The immediate problem is fixed. Luke changed the physical ports on both sides of the connection and it appears that there’s a problem with the original port in use on lefanu. While there might be a hardware problem, that’s not 100% clear. We haven’t decided what to do about it long term. We’ll put some effort today into figuring out how to tweak our monitoring such that we get paged for a problem like this.
- Lefanu is experiencing high inbound packet loss. We are investigating potential physical issues as this machine is identical to another which should have an identical software and hardware configuration. Affected customers will receive a credit for the downtime.
The larger problem is why our monitoring tools did not alert us; we will be looking into how to add or adjust the thresholds.
Sun, 18 Oct 2015 21:50:00 -0700 - Sarah Newman
UPDATE 2015-10-18 22:15 -0700 PDT: All affected instances should be back up.
We will beginning shutting down domains in about 10 minutes.
Tue, 13 Oct 2015 00:00:00 -0700 - Sarah Newman
Our provider will be replacing the reboot (power) switch for cattle and girdle and will need to unplug them and move them into the new reboot switch. They will be calling us when they’re ready to make the switch so hopefully the actual downtime will be limited to about 15 minutes.
Sat, 10 Oct 2015 12:26:00 -0700 - Sarah Newman
UPDATE 2015-10-10 12:49 -0700 PDT: Power has been restored.
Our data center in Santa Clara has informed us that they are currently using backup power as of 11:45 -0700 PDT. We have not experienced any problems so far, but if things suddenly go down this is why. We will post again if there are any updates.
Tue, 06 Oct 2015 13:00:42 -0700 - Sarah Newman
UPDATE 19:54 PDT -0700 (srn): We were down completely for about 20 minutes because I thought the IP was not black holed due to the bgp daemon not being notified of the configuration change. I reloaded the configuration change but this broke the routing table. Eventually I restarted the router and it was back to normal.
Our upstream black holed the offending IP so we should be OK for now. I am really sorry about this. This is the first time we have experienced a DDOS with the current router. I will review our configuration with Luke and with our upstream and try to figure out what I was doing wrong.
UPDATE 19:27 PDT -0700: The blackhole rule did not work properly. I (srn) have contacted our upstream provider for assistance and am continuing to work on it.
UPDATE 18:48 PDT -0700: it has been temporarily resolved; it would have been faster except I (srn) had trouble looking up how to take care of it. We will be following up with the targeted customer.
We are currently experiencing an incoming SSDP attack. The owners are currently working to resolve the situation. We will post updates as we have them.
If you are having issues, or need help with anything please contact email@example.com