knife reboot

| | Comments (1)
Last night knife stopped responding, so we rebooted it this morning and the domUs running on it, and the downtime was 12 hours. On the serial console, we couldn't log in to the dom0 and the dom0 kernel didn't respond to a break signal for the magic sysrq. The hypervisor responded to its escape code (crtl-a 3 times), and we were able to reboot the system from the hypervisor.

We don't know yet what caused the problem, but everyone on knife will get a free month. We are also planning to improve our monitoring system with nagios, and maybe a pager system to notify us more effectively.


If it helps the server was very sluggish before it went down.

Leave a comment