srn: July 2014 Archives

problems with battenberg

| | Comments (0)
UPDATE 11:45 PST (-7) : it looks like two disks fell out of the raid for as of yet undetermined reasons, though the disks are good.  Everyone should be coming back up now.
We have as yet unknown problems with battenberg; remote access is not working so Luke is heading to the data center right now.

Edit from luke, 11:58 PST:

it's back.   it had failed two disks out of the raid10, and froze as it should have.   Upon reboot, it brought one of the disks back and is currently rebuilding, as it should.   All guests should be back now.  

There are three troubling things about this outage. 

1. how long have the disks been bad?   I didn't get an email (we're still looking for the emails from this morning, and so far I only found one;  there should have been four, even if only one disk not two failed this morning.)

2. are the disks actually bad?    SMART mentions no reallocated sectors,   and usually when disks fail, there's a reallocated sector in smart.

3. the logging console wasn't hooked up and logging.  This was an oversight on my part, and a big one.  If I had the logging console hooked up, I would be able to tell you with a reasonable degree of confidence what actually happened.   The sad thing is that I had the server configured to use the SoL console, which is the hard part, but I hadn't configured it in conserver, which does the logging.     I will spend some time today on making sure I have console logging everywhere.   

About this Archive

This page is a archive of recent entries written by srn in July 2014.

srn: March 2014 is the previous archive.

srn: September 2014 is the next archive.

Find recent content on the main index or look in the archives to find all content.