rutledge raid rebuild

Last night we replaced a hard drive in rutledge and the raid was rebuilding normally until the disk completely froze. I rebooted it and I'm letting the raid rebuild in single user mode now. When its done, I will update the blog here. Email if you have any questions. Thanks!

update 20111214 8:36AM PST:  Luke here.   the dang thing rebuilt, then rebuilt again.  I'm suspecting a bad drive.   Smart on the thing hangs, and it reports drive errors (that all have to do with smart)   So I don't have real solid evidence that the drive is bad, but no smart, if you ask me, is enough reason to trash the drive anyhow.  

Error 1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  04 51 01 37 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d6 01 be 4f c2 00 00      00:00:46.516  SMART WRITE LOG
  b0 d5 20 bf 4f c2 00 00      00:00:46.016  SMART READ LOG
  b0 d6 01 be 4f c2 00 00      00:00:46.006  SMART WRITE LOG
  b0 d5 01 bf 4f c2 00 00      00:00:45.517  SMART READ LOG
  b0 d6 01 be 4f c2 00 00      00:00:45.507  SMART WRITE LOG

edit at 20111214 10:30am PST:
the thing rebuilt successfully and was rebooted about an hour ago.  SMART tests now look good (at least short and conveyance tests.   Long test still has 10% to go; I'll update when that's done) 

I'm no longer at all sure it was a disk problem;  I've seen errors like this when it was rebuilding too fast (there's a /sys/ entry that lets you limit rebuilt speed, and we need to tweak that down next time.  Used to be it limited itself to something reasonable.) 

I'll update again when the smart error clears;  for now, the machine is up, and I don't expect any more reboots.

