January 2012 Archives

mares crashed

| | Comments (0)
This morning crashed with this message on the console:
(XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU 0.
(XEN) Bank 4: dc0c4000fe080813[c008000401000000] at        363fe9000
so it does seem to have been mostly fatal. I was able to switch input to the hypervisor from the dom0 kernel and gave the command to reboot, but it didn't seem to do anything so after a few minutes I reset the power port. Its booting up again now. We haven't put any new customers on mares for a long time, and we are hoping to get rid of it soon! -Nick

rebuilding disk on horn

| | Comments (0)
you should see no downtime, but performance will be degraded for the next day or so.

stone reboot again

| | Comments (0)
stone.prgmr.com had another hard drive fail and it wasn't dropped from the raid, the system just hung instead. We rebooted it and will be replacing the failed drive later today or Saturday.

DDoS attack

| | Comments (0)
Starting at about  8pm Pacific time, a vps on taney was the target of a DDoS attack. Our upstream provider EGI Hosting blocked it, and we have notified the customer.

branch rebooted.

| | Comments (0)
It was the old forcedeth issue, but I screwed up the recovery procedure[1] and had to uncleanly reboot.    Users are coming back online now, and we should be good.   

update: everyone is back.


another bad drive in hydra

| | Comments (0)
The drive  we used to replace the drive that was bad in hydra the other day is unambiguously bad; I'm getting SMART errors all over the place.   This is the same drive we replaced the other day.   According to SMART, it tested okay before we put it in, but it hadn't undergone an 'extended' SMART test lately, which means I hadn't run it through burn-in. 

If I remember right, this drive came out of a decommissioned dedicated server-  a server that was working fine, so I had no reason to believe it was bad, but I really need to test everything I put into the spare pool, and testing drives requires thorough testing.  (I like several runs of badblocks -w, /then/ an 'extended' SMART test.   When in doubt, throw it out, so if badblocks /or/ smart reports an error, it gets warrantied.  The warranty policy on these 'enterprise' drives means that you can get them to replace drives you have doubts about for nothing more than the cost of shipping.) 

I believe the lesson here is that we need better spare pool hygiene and discipline.  Whenever a disk comes out of production, it needs to be tested before going to the spare pool.  

unresolved question:  do new drives, and drives refurbished by the factory (what we get when we return a drive for warranty) need to be burnt in before going to the spare pool?  obviously, we burn them in when putting them in fresh servers, so perhaps yes?  but I don't remember the last time a new drive failed burn-in (though, quite often they fail within the first month of production... so maybe I just need to make my burn-in longer.)

I don't expect downtime;   I have reduced the minimum raid rebuild speed, which I theorise may solve our "servers crash during raid rebuild" problems.  I'll write more about that later.

Update: The raid rebuilding seems to have gotten stuck with a very high load average and the cpu being occupied by software interrupts, so we are going to reboot hydra in single user mode. -Nick
It was a little more trouble than it should have been, I guess grub wasn't installed on the new drive or something so I had to pxeboot it into rescue mode to install grub, then boot into single user mode. The raid is now rebuilding properly though! One thing that confused me with the CentOS 6 rescue mode here, was that it seems the "mount" and "umount" executables are under /mnt/runtime, so if I mount the server's root filesystem on /mnt it gets very confused. Once I figured this out, I made a /mnt2 and mounted on there.
Update: hydra should be all back now! Email us if you have trouble at support@prgmr.com. Thanks!

hydra.prgmr.com is down

| | Comments (0)
Sometime last night hydra stopped responding, and on Friday it had a disk error. I rebooted it now into single user mode and the raid is rebuilding which might be because of the earlier disk error. When this finishes it maybe able to reboot properly again or we might replace some of its hardware.
Update: We have replaced a hard drive in hydra, and rebuilt the drives, and the guests are starting up again now.

hydra is back.