luke: November 2012 Archives

It looks like the data for 75% of the domains on boutros is gone.   srn
is still working on it, and seems to think it's worth her time;  Meanwhile I'm
going to set everyone up with new domains.   We'll get you the data later,
but at least for now you will have something.

The remaining 25% seem to have some minor corruption but seem mostly okay,
to the extent that we have poked.

Of course, you can ask for a refund;  this server mostly houses people who
signed up in the last month or so.  Considering the unacceptable level of
service I have given you, I think it's pretty reasonable for you to demand
a refund and leave.  Email support@prgmr.com and we'll give you a full
refund.

If you are willing to stick around and give me another chance, we will
double your ram and quadruple your disk (you get to keep the upgrade for
as long as you continue paying for your current plan.)   We will also
give you a 3 month credit.   (we're doing the credits by hand;  it's
pretty haphazard.  If you don't get it within the
next few days, complain to support@prgmr.com.)

I'll post another blog entry as I know more.   If you want to look at the
post-mortem notes srn has made, they are linked below, but it's still
raw stuff.  We should have more definitive 'what happened' answers in
the next few days.

http://wiki.prgmr.com/mediawiki/index.php/20121124boutros_post-mortem


I chose... poorly

| | Comments (9)
(well, it's also possible, maybe even likely, that the parts are fine and I somehow screwed up the backplane when I removed it to check it out. )

Anyhow, the story on boutros is that I had a bad drive, i went in to replace the drive, and 3 other drives failed immediately after I plugged in the drive.    I'm force-rebuilding the raid now, I don't anticipate significant data loss.

Uh, yeah. so I won't ever use these backplanes again (this is the first one in production... well, I have one that is a dedicated server, but that's less important.  Only two drives there, and mirrors are more resistant to this sort of thing than raid10 setups.) 

For now, boutros is in it's old chassis (and still using that backplane.)  I may swap it into a new chassis, or I may not.  For now, priority is getting the RAID rebuilt.

Update: 03:29 local time   the rebuld finished, then it says: 

raid10: Disk failure on sde2, disabling device.

upon reboot, it again only sees 5/8 devices.

I'm wiped out.  At this point I'm more likely to cause dataloss than anything else.  Nick has left, so this will wait until I sleep and then head in and deal with it. 

In the morning,  my plan is to remove the backplane and connect the disks directly (with a bunch of spares, too)  and see if that helps, and I think it will. 


update 09:31 local time:

I'm up.  I'm going to go remove that backplane now

update: 13:20 

The disks are all in a brand new motherboard/chassis, bringing it all back to 250 stockton and racking it up now.
so yeah, uh, hamper's network hung with the above errror.  The kernel is terrifyingly out of date, so I upgraded to the centos5.8-latest kernel-xen and rebooted;  well, then we had

ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen

...  


and the thing hung for a few minutes before it decided to kick the bad disk out of the raid.     I mean, it did so, now guests are coming back up.  the box should be fine now (of course, I need to replace the bad drive)


Edit: the thing froze up again, as I did some more smart tests on the bad drive before failing it out of md0.   Sorry.   Now it's failed out of md0, so it should be good.   I'll go replace the drive later.

that was a little frightening

| | Comments (0)
billing is back up (along with signups)  report any anomalies. 

some lessons need to be learned twice

| | Comments (2)
I used to ziptie in all the power cords to the servers, so that you can't pull the power cords out by accident.  Because I had accidentally unplugged a server.

Well, several years later, well, I don't put zipties on my power cords anymore.  so, tonight?  I slid birds/stables in... and disconnected Gladwynn. 


I need to have a new power cord retention policy.  Ouch. 

About this Archive

This page is a archive of recent entries written by luke in November 2012.

luke: October 2012 is the previous archive.

luke: December 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.