outage: March 2010 Archives

knife reboot

| | Comments (1)
Last night knife stopped responding, so we rebooted it this morning and the domUs running on it, and the downtime was 12 hours. On the serial console, we couldn't log in to the dom0 and the dom0 kernel didn't respond to a break signal for the magic sysrq. The hypervisor responded to its escape code (crtl-a 3 times), and we were able to reboot the system from the hypervisor.

We don't know yet what caused the problem, but everyone on knife will get a free month. We are also planning to improve our monitoring system with nagios, and maybe a pager system to notify us more effectively.
see http://book.xen.prgmr.com/mediawiki/index.php/Peth0:_too_many_iterations_%286%29_in_nv_nic_irq_rx.   for details.  

About this Archive

This page is a archive of entries in the outage category from March 2010.

outage: December 2009 is the previous archive.

outage: May 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.