I chose... poorly

| | Comments (9)
(well, it's also possible, maybe even likely, that the parts are fine and I somehow screwed up the backplane when I removed it to check it out. )

Anyhow, the story on boutros is that I had a bad drive, i went in to replace the drive, and 3 other drives failed immediately after I plugged in the drive.    I'm force-rebuilding the raid now, I don't anticipate significant data loss.

Uh, yeah. so I won't ever use these backplanes again (this is the first one in production... well, I have one that is a dedicated server, but that's less important.  Only two drives there, and mirrors are more resistant to this sort of thing than raid10 setups.) 

For now, boutros is in it's old chassis (and still using that backplane.)  I may swap it into a new chassis, or I may not.  For now, priority is getting the RAID rebuilt.

Update: 03:29 local time   the rebuld finished, then it says: 

raid10: Disk failure on sde2, disabling device.

upon reboot, it again only sees 5/8 devices.

I'm wiped out.  At this point I'm more likely to cause dataloss than anything else.  Nick has left, so this will wait until I sleep and then head in and deal with it. 

In the morning,  my plan is to remove the backplane and connect the disks directly (with a bunch of spares, too)  and see if that helps, and I think it will. 


update 09:31 local time:

I'm up.  I'm going to go remove that backplane now

update: 13:20 

The disks are all in a brand new motherboard/chassis, bringing it all back to 250 stockton and racking it up now.

9 Comments

What timezone are you in? Our guess is 23rd morning for a fix?

Luke is in the Pacific time zone.

gah. Okay, so I think we have the problem about licked. (I need a new new-model server now; I completely canibalized that) but it was a bad idea for me to try to use the original hardware I used anyhow. You are on much better hardware now.

I would say we've got another hour or two.

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid1 sdf1[7] sdg1[6] sdh1[5] sde1[4] sdc1[3] sdd1[2] sda1[0] sdb1[1]
10482304 blocks [8/8] [UUUUUUUU]

md1 : active raid10 sdg2[8](S) sdd2[9](S) sda2[10] sdb2[1] sde2[6] sdf2[5] sdc2[3] sdh2[2]
1911605248 blocks 256K chunks 2 near-copies [8/5] [_UUU_UU_]
[>....................] recovery = 3.3% (15826880/477901312) finish=66.4min speed=115925K/sec

unused devices:

so yeah, it looks like it's adding drives to the raid one at a time. It started over a bit ago and added another.


sh-3.2# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid1 sdf1[7] sdg1[6] sdh1[5] sde1[4] sdc1[3] sdd1[2] sda1[0] sdb1[1]
10482304 blocks [8/8] [UUUUUUUU]

md1 : active raid10 sdg2[8] sdd2[9] sda2[0] sdb2[1] sde2[6] sdf2[5] sdc2[3] sdh2[2]
1911605248 blocks 256K chunks 2 near-copies [8/6] [UUUU_UU_]
[==>..................] recovery = 14.4% (68866176/477901312) finish=62.6min speed=108861K/sec

unused devices:

Still can't connect to boutros... is it suppose to be up ?

no, it's still down. did you get an email last night? you should have

print SENDMAIL "Subject: Very serious problems with $username.xen.prgmr.com on boutrous.prgmr.com\n";
print SENDMAIL "Content-type: text/plain\n\n";
print SENDMAIL
"


Background:

http://blog.prgmr.com/xenophilia/2012/11/i-chose-poorly.html

https://twitter.com/prgmrcom

Essentially, I used a used chassis with a shitty backplane which looked
like a really good deal, until I went to replace a drive the other night,
and three other drives fell out of the raid at the same time. This was
the first (and last) of those chassis I will put into xen production.
(I have another one rented out as a dedicated server, but I'm less
worried about that, as it's only got a mirror, and a 2 disk mirror
deals way better with this sort of thing than a 8 disk raid10)

So, yeah. Right now? your disks are in a brand new (and much nicer,
as in, actually brand new) server. (well, the ram and some of the disks
are used, but it's ecc ram, which is easy to test, and the disks all pass
smart with flying colors. The motherboard, CPU, backplane and chassis are new
The motherboard and the backplane being the important bits, and it's all
nice and solid. Supermicro; I have a bunch of identical chassis.)

That's what I did this morning.

But as for the data? it is looking grim. I've been working on getting the
software raid to do something reasonable about the fact that it had one failed
disk and then three other disks failed out within a period of 30 seconds.

I have not given up; Chris and Sarah are helping me on it. We will see.

I will communicate again tomorrow.
";
close(SENDMAIL);

Hi Luke, I am on boutros and did not receive an email. Good luck w/ the data. Ironically, I was just playing with setting up backups during the day on Friday but figured they could wait until next week. That was 100% my responsibility, so I'll think twice about putting off that sort of thing in the future.

I did not get the email either (put my name in that hat). I, however, have done backups and just started with you, so if trying to save my data is the holdup, I don't need it. Just a new Ubuntu image and I'll set it back up.

Thanks

Leave a comment

About this Entry

This page contains a single entry by luke published on November 22, 2012 12:22 AM.

Peth0: too many iterations (6) in nv nic irq rx. was the previous entry in this blog.

hamper rebooted is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.