srn: February 2015 Archives

Downtime on nagios and wiki tomorrow

| | Comments (0)
Nagios and wiki will be down sometime within the window of Sat Feb 28 16:00-19:00 PST -0800.
We are making infrastructure changes over the next few days and are contacting customers directly regarding such changes.

If you update your contact information, please also inform support such that you do not miss any major news.

Update on fisher

| | Comments (0)
One move is in progress; two are remaining of the original 50+.

The motivation for bringing the system down was that during a planned move, an md5sum on the source and destination data did not match. After verifying the instance was down and rerunning md5sum twice locally, the md5sum still did not match across any of those 4 times.

The total amount of data md5sum'ed was around 200GB and a single bit error in any of those would have changed the checksum. It does not indicate mass corruption but I wanted to minimize the risk that bad data would be marked clean, which is why the power got pulled.

After bringing up the drives in a different chassis, md5sum ran 2x matched and also matched one of the md5sums collected from fisher, which gave us sufficient confidence that the data was OK to move.

We are performing a memory test on the old fisher box and will follow up with other tests to try to understand what happened and see if it can be detected less easily in the future.

seashell is down

| | Comments (0)
UPDATE 21:06 PST: All customers should be back up. Total downtime was approximately 1 hour 20 minutes.
---
UPDATE 20:56 PST -0800: There are no error messages in syslog indicating the reason for the reboot. The drives have been moved to a different chassis with a very similar processor. We will be bringing customers back up shortly.
---
Luke is on his way to the data center to investigate. 18 customers are affected.

pvgrub updated

| | Comments (0)
This should take care of the following error:

xc: error: panic: xc_dom_bzimageloader.c:699: xc_dom_probe_bzimage_kernel unable to XZ decompress
kernel: Invalid kernel

Please email support if you run into any issues.

discussion of fisher downtime

| | Comments (0)
Last night (3am our time) fisher rebooted itself and at that reboot it was not running xen. I then rebooted fisher into xen because it seemed like the right thing to do at the time.

I was not paged for the original reboot because xen rebooted it automatically and the amount of time the machine was unavailable was less than the alert cutoff. We will be monitoring a canary on fisher in addition to fisher itself, so should this happen again I should get paged.

There were these errors in the log last night:

Feb  1 00:43:30 fisher kernel: xend[32715] general protection rip:2af0a12bda2e rsp:7fffa1954b90 error:7ffc
...
Sun Feb  1 00:43:30 2015]Bad pte = 8100000040800165, process = ???, vm_flags = 100073, vaddr = 3eb7531000

Call Trace:
 [<ffffffff8020cf43>] vm_normal_page+0xe9/0xfd
 [<ffffffff80207860>] unmap_vmas+0x699/0xc68
 [<ffffffff8023b7dc>] exit_mmap+0x8f/0x10b
 [<ffffffff8023de91>] mmput+0x30/0x82
 [<ffffffff8021611d>] do_exit+0x2e5/0x91f
 [<ffffffff8024b11d>] cpuset_exit+0x0/0x88
 [<ffffffff8022bfcd>] get_signal_to_deliver+0x477/0x4aa
[Sun Feb  1 00:43:30 2015] [<ffffffff8025d368>] do_notify_resume+0x9c/0x7ba
 [<ffffffff80297fb3>] force_sig_info+0xae/0xb9
 [<ffffffff80268081>] do_page_fault+0x12c5/0x131b
 [<ffffffff8026fa63>] monotonic_clock+0x35/0x7b
 [<ffffffff80262df5>] thread_return+0x6c/0x113
 [<ffffffff8026076a>] retint_signal+0x5d/0xb7

swap_free: Bad swap offset entry b284022000000
swap_free: Bad swap offset entry 2a18a02a000000
...
Call Trace:
[Sun Feb  1 01:25:17 2015] <IRQ>  [<ffffffff802299ae>] __kfree_skb+0x9/0x1a
 [<ffffffff8885e36f>] :netbk:net_rx_action+0x7f1/0x8e2
 [<ffffffff88228286>] :e1000e:e1000e_poll+0x2ce/0x38b
 [<ffffffff8029299e>] tasklet_action+0x97/0x13b
 [<ffffffff80212f3a>] __do_softirq+0x8d/0x13b
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026eb41>] do_softirq+0x31/0x90
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
[Sun Feb  1 01:25:17 2015] [<ffffffff80270026>] raw_safe_halt+0x87/0xab
 [<ffffffff8026d52b>] xen_idle+0x38/0x4a
 [<ffffffff8024b23c>] cpu_idle+0x97/0xba
 [<ffffffff80764b11>] start_kernel+0x21f/0x224
 [<ffffffff807641e5>] _sinittext+0x1e5/0x1eb


Code: 8b 43 08 85 c0 7f 26 48 c7 c1 17 2e 4a 80 ba 99 00 00 00 48
RIP  [<ffffffff80425173>] skb_release_head_state+0x12/0xf8
 RSP <ffffffff807a1e20>
[Sun Feb  1 01:25:17 2015] <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
-----
 
Doing further research it seemed more and more likely there could be a hardware error. I had not checked the BIOS error log at 3am and dmidecode was not able to show the error log. The kernel we are running does not have an EDAC module which works with this chipset so it could not be used to monitor for errors. I did not feel comfortable leaving the server running without checking the logs. We rebooted to look at the BIOS log and no error was shown there. 

Normally we would just move the hard drives to a different chassis if there was any question about the hardware. However, our new standard server has fewer disks than fisher, so this was not possible.

There was a chassis with enough drive bays free. However it had just put a new raid controller in, which had not been configured and we are not yet familiar with, so I didn't want to muck with it.  There was a live server with enough free drive bays but it was not located on the same network, so it could have only been used as a method of reading data from fisher's drives and could not be used as a host for fisher's guests.

The shortest path to getting everyone back up seemed to be to run memtest and to concurrently provision a new server should it be necessary. An hour of memtest did not show any issues. I considered leaving the server down and moving everyone one-by-one, however that would have resulted in up to about 7 hours downtime for the last guest. I could not justify this amount of downtime based on the available evidence so fisher has been brought back up.

We will still be moving people off of fisher but will try to do so in a fashion that minimizes overall/unexpected downtime. We will contact you before any move.

more downtime on fisher

| | Comments (0)
After investigating there is a high probability (not 100% certain) that there is a memory problem on fisher. While the probability of any one person experiencing an error is extremely low, I am bringing down all the customers in order to prevent the possibility. We are in the process of identifying alternative space and will send an update when we have done so.

About this Archive

This page is a archive of recent entries written by srn in February 2015.

srn: December 2014 is the previous archive.

srn: March 2015 is the next archive.

Find recent content on the main index or look in the archives to find all content.