The drive we used to replace the drive that was bad in hydra the other day is unambiguously bad; I'm getting SMART errors all over the place. This is the same drive we replaced the other day. According to SMART, it tested okay before we put it in, but it hadn't undergone an 'extended' SMART test lately, which means I hadn't run it through burn-in.
If I remember right, this drive came out of a decommissioned dedicated server- a server that was working fine, so I had no reason to believe it was bad, but I really need to test everything I put into the spare pool, and testing drives requires thorough testing. (I like several runs of badblocks -w, /then/ an 'extended' SMART test. When in doubt, throw it out, so if badblocks /or/ smart reports an error, it gets warrantied. The warranty policy on these 'enterprise' drives means that you can get them to replace drives you have doubts about for nothing more than the cost of shipping.)
I believe the lesson here is that we need better spare pool hygiene and discipline. Whenever a disk comes out of production, it needs to be tested before going to the spare pool.
unresolved question: do new drives, and drives refurbished by the factory (what we get when we return a drive for warranty) need to be burnt in before going to the spare pool? obviously, we burn them in when putting them in fresh servers, so perhaps yes? but I don't remember the last time a new drive failed burn-in (though, quite often they fail within the first month of production... so maybe I just need to make my burn-in longer.)
I don't expect downtime; I have reduced the minimum raid rebuild speed, which I theorise may solve our "servers crash during raid rebuild" problems. I'll write more about that later.
Update: The raid rebuilding seems to have gotten stuck with a very high load average and the cpu being occupied by software interrupts, so we are going to reboot hydra in single user mode. -Nick
It was a little more trouble than it should have been, I guess grub wasn't installed on the new drive or something so I had to pxeboot it into rescue mode to install grub, then boot into single user mode. The raid is now rebuilding properly though! One thing that confused me with the CentOS 6 rescue mode here, was that it seems the "mount" and "umount" executables are under /mnt/runtime, so if I mount the server's root filesystem on /mnt it gets very confused. Once I figured this out, I made a /mnt2 and mounted on there.
Update: hydra should be all back now! Email us if you have trouble at email@example.com. Thanks!