luke: December 2011 Archives

stone down hard

| | Comments (0)
I/O errors on both drives.   I'll be heading in to deal with it personally.   One drive failed out of the raid last night, and the other is giving I/O errors (and hanging the box) this morning.  I hope
it's a backplane issue.  Otherwise, I guess it's data recovery time.   

Edit:  weird.  On the upside, it looks like the data is still there.  I'm bringing in a spare hardware in case it's a backplane problem or similar bullshit.

Update: We put the drives in the new system and its starting up now.  We replaced the drive that failed first with a new drive just in case it turns out that it was actually a problem with the hard drive.  Stone should be good now.
we also found that we don't know how to pass break (for magic sysrq) through our terminal system, so we need to fix that.  

Anyhow, it isn't coming up after a remote reboot;  nick is heading to the co-lo now (this one is near his house)  to deal with the problem in person.

Also, we need to adjust the monitoring system;  it didn't register the host as down until we rebooted it, as it was still responding to ping and giving a ssh banner.  

Update: It is now rebuilding the raid mirror in single user mode and should be done in about an hour, then I will boot it up again multi user. Unfortunately the logs don't show any errors before it crashed (and that was at 2:32 last night so its been down for a while and the monitoring should help...). We also don't have console logging here like we do at SVTIX and Market Post Tower, so we don't really know why it seems the filesystem stopped responding. -Nick
 Update: Stone has booted up in multiuser mode now and the guests are starting. Email support@prgmr.com if you have problems after this. Thanks! -Nick

which is to say, we swapped a hot-swap drive, and it hung during the RAID rebuild.    I suspect that we need to limit the raid rebuild speed, but I don't have hard evidence.   We will research this.

The annoying thing is that it hangs rather than printing something useful to the console.   

Anyhow, we took the box down and are rebuilding the raid in single user mode.   


sh-3.2# cat /proc/mdstat 
Personalities : [raid1] [raid10] 
md1 : active raid10 sda2[0] sdb2[1] sdc2[2] sdd2[4]
      1048578048 blocks 256K chunks 2 near-copies [4/3] [UUU_]
      [=>...................]  recovery =  6.9% (36467328/524289024) finish=71.9min speed=112928K/sec
      
md0 : active raid1 sdd1[3] sdc1[2] sdb1[1] sda1[0]
      10482304 blocks [4/4] [UUUU]
      
unused devices: <none>


So figure we'll be back in around an hour and a half.


edit at 20:40 PST:  rebooted.  guests are coming back up.  I expect no further problems (at least until the next disk fails; we will hopefully have figured out a solution by then.)

About this Archive

This page is a archive of recent entries written by luke in December 2011.

luke: October 2011 is the previous archive.

luke: January 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.