Archive of posts with tag 'operations'
-
Maintenance for Billing
Mon, 04 Jan 2016 15:30:00 -0800 - Will Crawford
https://billing.prgmr.com will go down briefly tonight around 9pm for maintenance.
-
Recent software upgrade status and report
Thu, 17 Dec 2015 10:07:00 -0800 - Sarah Newman
As you may be aware already, Xen recently did public releases of several Xen Security Advisories . As a public service provider, we believe it is a requirement to be fully patched for any critical vulnerabilities for public release. Specifically, we will do upgrades for information leaks and for privilege escalation vulnerabilities, but not DOS vulnerabilities, since currently we would be disrupting service to provide them.
This particular round of reboots was primarily due to XSA-155. XSA-155 can lead to privilege escalation, which matches one of the advisories we will perform upgrades for.
We barely met our goal to be fully patched by the time of public release - the last round of VMs came up about 5 minutes before the embargo deadline. We aren’t sure if this is terrible or perfect planning since we gave as much notice as we could before doing the reboots.
Here is a too detailed account of what happened:
Friday December 4th: The first notice for XSA 155 was received this morning. I reviewed our current set of servers and came up with a upgrade schedule. The schedule was structured with the following goals:
- Customers would receive at least one week’s notice that they would be rebooted.
- There would be one day of not very many reboots in advance of the bulk of the work such that we could work out updates to our ansible playbooks and procedure when only one server was down at a time, and that this would be on a day which was not during business hours for most people (IE Friday night.)
- These test upgrades would address at least one of each operating system we ran at the time, specifically CentOS 5 and CentOS 6.
- The upgrades we believed could be the most problematic were to be scheduled for the weekend.
- The majority of reboots would occur on the weekend for the majority of our customers - most of our customers are in the US. So the majority of the updates happened on Saturday and Sunday. These started at 20:00 PST and were originally scheduled to end at 2:00.
- Any upgrades that had to occur during the week would be at night for the majority of our customers - this ended up originally being 21:00 and 23:00.
Of course this changed, as I will get into later.
The above schedule was completed the same day. All initial notifications emails were sent by 20:00 PST -0800, starting with reboots on December 11th.
We found out later some people did not see this email. We don’t know if the email simply got missed or if it went to spam. Please white list prgmr.com and add any extra emails you want to get notified as technical contacts, see our wiki page on managing contacts .
Also on Friday, an email was sent to CentOS requesting any updates to packages so we could start from the latest.
Saturday 5th and Sunday 6th: spent trying to get the next version of the distributions finished (was not able to in time)
Monday 7th: Sanity day
Tuesday 8th: We were initially warned that there were circumstances under which the patches wouldn’t be effective, specifically if a compiler had gcc bug 58145. Since there was no CVE for the bug and no good way to know which versions were affected, I went back to the original test case for that bug and ran it for the CentOS 6 and CentOS 5 compilers. I am not terribly familiar with x86 assembly so it took me a while to go through each line and understand what was going on, but I confirmed that the compiler bug was not present for CentOS 6 and CentOS 5.
Wednesday 9th: Reviewed what XSAs were applied to CentOS 6 to try to confirm there would not be any regressions. I worked backwards and mostly trusted the changelog. There was one XSA not applied (142) but it does not impact us since it is for file backed storage. I sent an email to the CentOS-virt sig informing the group. For the XSA review, I went back as far as XSA 60 but at some point they did a complete import from vanilla xen and I decided that was probably far enough.
I applied the patches to the latest Xen4CentOS package and rebuilt. It was first loaded on an HVM dom0, during which we found that oxenstored did not work properly with xl console so that change got added to our ansible roles. It next got loaded on our test box.
Thursday 10th: I reviewed the XSA history for CentOS 5 and discovered that Redhat had not patched some CVEs that we performed reboots for back in February and March of this year - most specifically CVE-2015-2151. We had regressions on two machines that were updated to the latest CentOS 5 due to this. I also looked at backporting the patches for XSA155 and found that the code differed so much that I did not have any confidence in my ability to correctly patch the code.
At this point I didn’t see any option but to upgrade to CentOS 6 from CentOS 5 (we’ve already been running CentOS 6 in production for a year) which was a significant change in strategy. However, I had done some work on this back in December 2014 and it had mostly been working, so it was not a stretch to change plans this late.
Doing the CentOS 5 to CentOS 6 change required booting into rescue mode. This was practical for all of our machines except for the two in Sacramento that were already scheduled to be decommissioned. We made the decision to take a server up there already with CentOS 6 installed (at least that was the plan) and then move the drives from the old machines to the new ones. The first of the ones in Sacaramento had initially been scheduled for Friday night but we weren’t going to be ready, so I rescheduled from Friday to Saturday night. Since Luke was going to have to slog up to Sacramento etc. and I was probably going to need to help with the migration, I rescheduled two more updates from Saturday night to Tuesday and Wednesday night at 21:00 -0800 PST respectively. These were all sent with at least 24 hours notice. We started the machine slated for Sacramento memory testing.
Friday 11th: Because there was not enough time to test but Friday is probably the best time for downtime, I didn’t want to reschedule to a different night. So I sent an email to the CentOS 5 customers for that night letting them know I didn’t know exactly when the maintenance was going to start or end. This happened about 7.5 hours before the window was scheduled to start.
I worked through changes to our playbook for updating CentOS 6 with our test machine. Since there had been other reboots for XSAs with CentOS 6 quite recently I did not do a full run. I ran through some of the procedure for the CentOS 5 to CentOS 6 upgrade with an HVM dom0 but was not able to complete this before the scheduled time for the CentOS 6 upgrade. The CentOS 6 upgrade went well. After this I booted a decommissioned CentOS 5 host and went through the full procedure for the upgrade. The upgrade for the production CentOS 5 server started about 3 hours late (~ 1:00 -0800 PST Saturday) and ended about about 2 hours late (~ 2:00).
Saturday 12th: I got most of the way through installing CentOS 6 on the new Sacramento server, but was not able to finish by the time Luke needed to catch his train.
Saturday night did not go as well. Here are some of the problems encountered:
- Difficulty breaking into bios to do netboot due to lack of function keys on chromebook
- Changes in names of network devices between CentOS 5 and CentOS 6. On subsequent nights this was avoided because we added a template for /etc/udev/rules.d/70-persistent-net.rules to the ansible roles.
- chroot script did not unmount things in the correct order for all hosts and it was overly aggressive about cleaning up after itself, so the work for at least a couple of servers was lost and had to be redone. The script was fixed that night.
- The procedure was not as well documented as it should have been and Friday night wasn’t used as a training run, so people missed some steps and misunderstood some steps.
- People running with a more modern version of ansible that implemented the yum module differently such that a repository was not getting installed before use.
- On one server, the serial console text turned black during the boot process so we initially thought a server wasn’t booting when it was.
- On one server, there was a domain that should have been canceled but wasn’t such that the last domain to boot couldn’t because there wasn’t enough memory, and we didn’t notice until someone complained. We had had checks for things like this before and they got added back in for subsequent nights.
- Our ansible playbooks had some hacks for older servers that were originally conditional on CentOS 5 but shouldn’t have been any more and we didn’t discover this until customer reports the next day of not being able to access the management console. It was fixed as soon as we found out about it.
- A number of VMs failed to boot for problems we did not have time to look into, but we sent emails about it. More on this later.
On average this ran an hour over the scheduled maintenance window. For the Sacramento servers, we had to finish installing CentOS 6 and also had to write/test the playbooks to migrate the customer data. The actual downtime for the Sacramento customers was probably less than half an hour but it occurred about an hour and a half after the scheduled maintenance window end.
Sunday 13th: We got paged and spent some time on network issues due to another HE.net customer being DOS’ed. This is significant only so far as it interfered with other work. There were new problems due to:
- ssh taking a while to die, such that we had to start manually killing it.
- A bug in an ansible playbook had been fixed a while ago, but the buggy changes it originally made weren’t, such that some xen configuration files were invalid.
- Different bios versions having different navigational methods.
- Unusual raid configurations due to us adding on disks later combined with drive letter changes meant assembling the raid by hand was difficult, and –assemble –scan didn’t always quite work.
- For some servers we encountered kernel messages related to mptsas: swiotlb full under heavy disk load. One server was down for an additional two and a half hours beyond what was originally scheduled while we tried to make it go away. Research eventually suggested this is a performance problem only, at which point we decided to let it be. One host server that was upgraded early is occasionally throwing this error, but we picked a larger value for subsequent hosts and for these we aren’t seeing problems under normal operation. We intend to investigate this in further detail later, as if and when we start using SSDs it will probably become a limiting factor.
- Save/restore of memory (equivalent to hibernation) by xendomains wasn’t being disabled as it should have been and for one server that had to be rebooted to increase the value of swiotlb, it started saving the VM memories which ended up filling the disk. I tried to make it so that the machines that were saved would be resumed again after the reboot, but the change we originally intended (to disable save/restore) was in place so those domains weren’t restored. This meant those 5 or 6 VMs were uncleanly shut down. Disabling of save/restore was fixed in our ansible playbooks for subsequent runs.
- One server that had been switched between chassis didn’t have the right mac address in its DHCP entry. This is one thing that hasn’t been templated.
- One server had its ip6tables rules written out while it only had an ipv4 address, so ipv6 wasn’t working for it after reboot though it worked for the VMs. We removed the ipv6 AAAA records for it instead of trying to fix the ip6tables rules.
All of the servers except the one that run 2.5 hours over, ran one hour over the originally scheduled window.
Monday 14th: We got more concrete reports about VMs failing to boot under Xen 4. The worst case was the person who was running Fedora 17 as there is no upgrade path. We ended up giving them a clean install of Centos 7 and attached their old block device as an extra disk.
We had previously moved at least a couple of hundred VMs from Xen 3 to Xen 4 so this was a surprise to us. Unfortunately we didn’t see a way to stay on Xen 3 and fix the XSAs. Possibly we should have left one box as CentOS 5 with the understanding it was going to be vulnerable, but we didn’t have time to set up a new server as we didn’t want to use an existing server with other customer data on it.
We wanted to delay the schedule to give people more time to upgrade the kernels, but this would have made it even more difficult to be patched by the time the embargo was lifted.
We wrote wiki pages on accessing your disk from the rescue image and guest kernel changes for xen 4 and sent an email to the remaining people who were going to be upgraded onto Xen 4. It was very frustrating because people would write us asking if their kernel versions would crash and we couldn’t really answer because we didn’t (and don’t) know what the underlying problem was, plus there are too many kernel versions out there. This is the downside to running your own kernel - we can’t test every version. In the future we will consider an option for us to manage people’s kernels instead of using pvgrub(2).
The upgrades for Monday night all completed within their window.
Tuesday 15th:
Problems this night:
- One server had its upgrades completed within the maintenance window but I fat fingered my password without noticing and the emails about the server going up weren’t sent.
- One server had the module line for initramfs missing in grub.conf (a known problem with centos) and since we haven’t gotten to adding a playbook entry to fix it, we’ve been hand fixing it when it happens. But there are differences in the partition layouts and the wrong version was being used such that the initramfs was not being found, which delayed the upgrade completing.
- One server had a root partition much smaller than normal and there wasn’t enough room to install a new kernel onto it, so space had to be cleared out on it.
- One server, after being rebooted into the CentOS 6 rescue, was throwing warnings about some physical volumes for LVM not being found. One partition that had originally been in the raid was also not marked as being a raid component after reboot. From the CentOS 5 root file system the physical volumes were identified correctly. Since continuing with CentOS 6 seemed like a risky idea, we rebooted into the original CentOS 5 and used our move playbook to move VMs individually from this server to another server that had enough room. I considered other methods but the move playbook is well tested and I knew it would work, even if it was slower than it could have been by doing a more direct data transfer. People on this server had up to an extra 8 hours of downtime, though it was in the middle of the night for most of the customers and we tried to prioritize moves based on time zone. We have given them a months credit already.
- Due to this one server being extremely problematic and requiring attention from multiple people, another server was rescheduled from Tuesday to Wednesday night with less than 24 hours notice.
Two of the servers were completed on time, one was an hour late, and as mentioned one was up to 8 hours late.
Wednesday 16th:
Problems:
- When a drive was replaced on a CentOS 5 host during the maintenance window but prior to it being upgraded, there was a null pointer bug in the kernel and it spontaneously rebooted, so all those people were shut down uncleanly. We hope the same problem does not exist in CentOS 6.
- The pxe boot menu was generally working with a serial console number of ‘0’ even if the linux serial console number was something other than 0. On one host it was the not the case and it took a number of reboots before isolating and fixing the problem, since none of the other hosts (even ones hypothetically of the same configuration) had the same issue.
- Someone forgot to start the VMs after completing the upgrade so there was about an additional hour and a half of unecessary downtime for most people on that host server. We have “canaries” that should be in our monitoring but nobody has gotten to it. It should be easy to generate the configuration files for them.
- One host was going OK during the upgrade but spontaneously shut down during the rescue image portion of the upgrade. I went and looked at the chassis and it had a red blinking light for health. I dug a little but it seemed best to move the drives to a new chassis. After ironing out a few issues with netboot it went smoothly, but the it ran over the maintenance window by about 1.5 hours due to the hardware failure.
- The final upgrade went smoothly - it took about an hour and 15 minutes - and completed about 5 minutes before the embargo lifted. This was the one that was rescheduled from Tuesday night. I made the window 4 hours instead of 2 because I wanted to do it as soon as the previously scheduled upgrades had completed.
Thursday 17th:
The public facing server for our own infrastructure was upgraded in about half an hour.
CentOS accepted some of our changes for the XSAs: xsa155-qemu44-qdisk-double-access.patch kernel.spec
Action items:
We intend to give a proportional credit (IE 1/744 of a months payment for 1 hour of downtime) for all of the maintenance outside of the originally scheduled window with the exception of the server where we already gave a months credit. This should be completed within the next week.
We need to investigate the mptsas/swiotlb kernel issue more.
We need to automate changing people’s public keys. This has been a TODO for a long time but usually support is light enough we are able to respond in a reasonable amount of time. But during this, it was very urgent for some people and we were extremely short staffed so sometimes it did not happen in time and people suffered. But there was a hard deadline and so we had to prioritize the needs of the majority of our customers. I highly recommend people check out GPG smart cards or possibly yubikeys so that their public key is not tied to a particular computer.
We should add the already allocated VM canaries to our monitoring.
We intend to write a follow up post discussing the different strategies for decreasing downtime. It won’t be something we can implement in the short term for all our hosts. However, now that all of our servers are running the same operating system, and one that is more modern, this should help us a lot with maintenance and long term goals.
-
Network problems
Sun, 13 Dec 2015 13:11:00 -0800 - Will Crawford
UPDATE 16:05: it appears that our network outage was due to a large DDOS on another customer of our upstrem. They say that the DDOS is now mitigated.
We’re currently experiencing some issues with our network and we’re working to resolve them.
-
Downtime for wiki.prgmr.com
Sat, 12 Dec 2015 18:45:00 -0800 - Sarah Newman
Update 19:18 : wiki.prgmr.com is back up.
The wiki will be down for up to an hour. I’ll update this entry when it’s back online.
-
PSA: Please update jenkins
Tue, 24 Nov 2015 15:20:00 -0800 - Sarah Newman
If you are running jenkins, please update it to the latest version and check your logs (including your mail log) for any suspicious activity. A zero-day in jenkins was patched on November 11. If you believe you have been compromised and want help reinstalling, please contact support.
-
Recent Xen Security Advisories
Thu, 29 Oct 2015 10:30:00 -0700 - Sarah Newman
Update 16:01 PDT: In case it wasn’t clear, we are on the pre-disclosure list and all of the reboots to apply patches to public facing machines happened before public disclosure.
Here is a rundown of our vulnerability and response for the xen security advisories released today:
- xsa-145 - Arm only, not affected
- xsa-146 - Arm only, not affected
- xsa-147 - Arm only, not affected
- xsa-148 - This was patched in affected public facing systems. This is high impact - “malicious PV guest administrators can escalate privilege so as to control the whole system.” While the minimum vulnerable version is specified as xen 3.4, I still reviewed the commit specified as being the source of the vulnerability in the attached patches, the patches supplied with the original XSA (which do not apply to Xen 3.4), and additional patches later sent out by someone against 3.4 to verify that the remainder of our systems were not affected.
- xsa-149 - Not vulnerable because we use xl.
- xsa-150 - All publicly facing systems with HVM guests have been patched, though the only such systems are test systems wholly used by prgmr.com.
- xsa-151 - Patched in the affected public facing systems. This is a denial of service attack that we might have caught due to a job that runs each night to find, shutdown, and notify us of rebooting guests.
- xsa-152 - Patched in the affected subset of the systems also patched for xsa-148 and xsa-151, not patched in the remainder of the systems. Since the result is a denial of service attack and not a privilege escalation, we will address this as needed, as patching it would have led to loss of service as well. Systems that exploit this vulnerability will be apparent from the log messages.
- xsa-153 - Not vulnerable because we do not use HVM guests, and if we did we still still not be vulnerable - we would not use memory populate-on-demand as we do not oversubscribe ram.
-
Maintenance for prgmr.com services Tues. 27th 22:00 -0700
Mon, 26 Oct 2015 11:18:00 -0700 - Sarah Newman
The following services will be taken down for maintenance Tuesday the 27th at 22:00 PDT -0700:
- billing.prgmr.com
- prgmr.com
- mirror.prgmr.com
- Our ticketing system
I’m allowing 3 hours for the maintenance to complete.
-
High inbound packet loss for lefanu
Sat, 24 Oct 2015 10:58:00 -0700 - Sarah Newman
Update 16:34 -0700: The reason why this happened is that someone, when setting up the nagios monitoring, used default values from some example. It turns out we were not paging until we got to 60% packet loss. Warn at 2% loss and page at 6% loss is way more reasonable. Luke set the nagios thresholds to page when the packet loss exceeds the new, lower thresholds.
Update 13:25 -0700: ipv6 was disrupted at the time we switched ports. ipv6 connectivity should now be restored.
Update: The immediate problem is fixed. Luke changed the physical ports on both sides of the connection and it appears that there’s a problem with the original port in use on lefanu. While there might be a hardware problem, that’s not 100% clear. We haven’t decided what to do about it long term. We’ll put some effort today into figuring out how to tweak our monitoring such that we get paged for a problem like this.
- Lefanu is experiencing high inbound packet loss. We are investigating potential physical issues as this machine is identical to another which should have an identical software and hardware configuration. Affected customers will receive a credit for the downtime.
The larger problem is why our monitoring tools did not alert us; we will be looking into how to add or adjust the thresholds.
-
Downtime on cattle/girdle proceeding as scheduled
Sun, 18 Oct 2015 21:50:00 -0700 - Sarah Newman
UPDATE 2015-10-18 22:15 -0700 PDT: All affected instances should be back up.
We will beginning shutting down domains in about 10 minutes.
-
Downtime on cattle/girdle on Sunday, October 18 between 22:00-23:00 -0700 PDT
Tue, 13 Oct 2015 00:00:00 -0700 - Sarah Newman
Our provider will be replacing the reboot (power) switch for cattle and girdle and will need to unplug them and move them into the new reboot switch. They will be calling us when they’re ready to make the switch so hopefully the actual downtime will be limited to about 15 minutes.
-
Santa Clara Data Center on Backup Power
Sat, 10 Oct 2015 12:26:00 -0700 - Sarah Newman
UPDATE 2015-10-10 12:49 -0700 PDT: Power has been restored.
Our data center in Santa Clara has informed us that they are currently using backup power as of 11:45 -0700 PDT. We have not experienced any problems so far, but if things suddenly go down this is why. We will post again if there are any updates.
-
Incoming DDOS Attack
Tue, 06 Oct 2015 13:00:42 -0700 - Sarah Newman
UPDATE 19:54 PDT -0700 (srn): We were down completely for about 20 minutes because I thought the IP was not black holed due to the bgp daemon not being notified of the configuration change. I reloaded the configuration change but this broke the routing table. Eventually I restarted the router and it was back to normal.
Our upstream black holed the offending IP so we should be OK for now. I am really sorry about this. This is the first time we have experienced a DDOS with the current router. I will review our configuration with Luke and with our upstream and try to figure out what I was doing wrong.
UPDATE 19:27 PDT -0700: The blackhole rule did not work properly. I (srn) have contacted our upstream provider for assistance and am continuing to work on it.
UPDATE 18:48 PDT -0700: it has been temporarily resolved; it would have been faster except I (srn) had trouble looking up how to take care of it. We will be following up with the targeted customer.
We are currently experiencing an incoming SSDP attack. The owners are currently working to resolve the situation. We will post updates as we have them.
If you are having issues, or need help with anything please contact support@prgmr.com
