discussion of fisher downtime

| | Comments (0)
Last night (3am our time) fisher rebooted itself and at that reboot it was not running xen. I then rebooted fisher into xen because it seemed like the right thing to do at the time.

I was not paged for the original reboot because xen rebooted it automatically and the amount of time the machine was unavailable was less than the alert cutoff. We will be monitoring a canary on fisher in addition to fisher itself, so should this happen again I should get paged.

There were these errors in the log last night:

Feb  1 00:43:30 fisher kernel: xend[32715] general protection rip:2af0a12bda2e rsp:7fffa1954b90 error:7ffc
...
Sun Feb  1 00:43:30 2015]Bad pte = 8100000040800165, process = ???, vm_flags = 100073, vaddr = 3eb7531000

Call Trace:
 [<ffffffff8020cf43>] vm_normal_page+0xe9/0xfd
 [<ffffffff80207860>] unmap_vmas+0x699/0xc68
 [<ffffffff8023b7dc>] exit_mmap+0x8f/0x10b
 [<ffffffff8023de91>] mmput+0x30/0x82
 [<ffffffff8021611d>] do_exit+0x2e5/0x91f
 [<ffffffff8024b11d>] cpuset_exit+0x0/0x88
 [<ffffffff8022bfcd>] get_signal_to_deliver+0x477/0x4aa
[Sun Feb  1 00:43:30 2015] [<ffffffff8025d368>] do_notify_resume+0x9c/0x7ba
 [<ffffffff80297fb3>] force_sig_info+0xae/0xb9
 [<ffffffff80268081>] do_page_fault+0x12c5/0x131b
 [<ffffffff8026fa63>] monotonic_clock+0x35/0x7b
 [<ffffffff80262df5>] thread_return+0x6c/0x113
 [<ffffffff8026076a>] retint_signal+0x5d/0xb7

swap_free: Bad swap offset entry b284022000000
swap_free: Bad swap offset entry 2a18a02a000000
...
Call Trace:
[Sun Feb  1 01:25:17 2015] <IRQ>  [<ffffffff802299ae>] __kfree_skb+0x9/0x1a
 [<ffffffff8885e36f>] :netbk:net_rx_action+0x7f1/0x8e2
 [<ffffffff88228286>] :e1000e:e1000e_poll+0x2ce/0x38b
 [<ffffffff8029299e>] tasklet_action+0x97/0x13b
 [<ffffffff80212f3a>] __do_softirq+0x8d/0x13b
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026eb41>] do_softirq+0x31/0x90
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
[Sun Feb  1 01:25:17 2015] [<ffffffff80270026>] raw_safe_halt+0x87/0xab
 [<ffffffff8026d52b>] xen_idle+0x38/0x4a
 [<ffffffff8024b23c>] cpu_idle+0x97/0xba
 [<ffffffff80764b11>] start_kernel+0x21f/0x224
 [<ffffffff807641e5>] _sinittext+0x1e5/0x1eb


Code: 8b 43 08 85 c0 7f 26 48 c7 c1 17 2e 4a 80 ba 99 00 00 00 48
RIP  [<ffffffff80425173>] skb_release_head_state+0x12/0xf8
 RSP <ffffffff807a1e20>
[Sun Feb  1 01:25:17 2015] <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
-----
 
Doing further research it seemed more and more likely there could be a hardware error. I had not checked the BIOS error log at 3am and dmidecode was not able to show the error log. The kernel we are running does not have an EDAC module which works with this chipset so it could not be used to monitor for errors. I did not feel comfortable leaving the server running without checking the logs. We rebooted to look at the BIOS log and no error was shown there. 

Normally we would just move the hard drives to a different chassis if there was any question about the hardware. However, our new standard server has fewer disks than fisher, so this was not possible.

There was a chassis with enough drive bays free. However it had just put a new raid controller in, which had not been configured and we are not yet familiar with, so I didn't want to muck with it.  There was a live server with enough free drive bays but it was not located on the same network, so it could have only been used as a method of reading data from fisher's drives and could not be used as a host for fisher's guests.

The shortest path to getting everyone back up seemed to be to run memtest and to concurrently provision a new server should it be necessary. An hour of memtest did not show any issues. I considered leaving the server down and moving everyone one-by-one, however that would have resulted in up to about 7 hours downtime for the last guest. I could not justify this amount of downtime based on the available evidence so fisher has been brought back up.

We will still be moving people off of fisher but will try to do so in a fashion that minimizes overall/unexpected downtime. We will contact you before any move.

Leave a comment

About this Entry

This page contains a single entry by srn published on February 1, 2015 4:09 PM.

more downtime on fisher was the previous entry in this blog.

pvgrub updated is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.