Outage on White

edit:  Note, this was down a long time.   like half a day.    It didn't set off my pager, as it was still pingable and answering on ssh (though, it was not authenticating if your keys weren't cached in memory)   This has happened several times now, where outages were... long because my pager didn't go off because the machine was still half up.  

We had a hung disk.  I've brought the system back up but am still troubleshooting.  I'd say there is a 90% chance that I'll figure it out without requiring another reboot.    (one of the disks is suspect,  but I'm poking further 'cause there is some conflicting evidence)  

So, evidence that it is sdd:h

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Fatal or unknown error        80%     26894         -

I have /never/ seen this ....  but also:

196 Reallocated_Event_Count 0x0032   189   189   000    Old_age   Always       -       11

Bad sectors are bad.  so, I'd like to shoot the disk.  but,

[lsc@white ~]$ sudo -i iostat -x /dev/sdb /dev/sdc /dev/sdd /dev/sde
Linux 2.6.18-371.1.2.el5xen (white.prgmr.com)     10/27/2013

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.76    0.02   14.87   18.60    0.07   58.68

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb              94.92    16.91 171.32  8.24 20463.91   200.08   115.09     3.29   18.32   2.69  48.36
sdc             159.85    17.49 191.02  6.55 21081.82   191.24   107.67     3.71   18.78   2.60  51.37
sdd              86.65    17.50 170.97  6.55 20368.76   191.24   115.82     2.47   13.90   2.08  36.97
sde             170.26    16.91 193.60  8.24 21210.90   200.09   106.08     6.45   31.93   3.46  69.92

If I'm reading that right, sdd is the /fastest/  (well, lowest latency, which is what I care about) disk on that server.    (sda, in this case, is a spare and ignored.)

Anyhow, it's only a 500gb disk,  (and thus will be quicker to rebuild)  so I think I might just fail sdd and rebuild onto the spare.

also note, I put more ram in that box while it was down in preparation for customer upgrades (which I haven't done)

