luke: July 2013 Archives

dao had a hung disk.  a hard reboot was required. 

It's unclear which disks were the problem.   will look into it after sleep.


[Mon Jul 29 15:20:55 2013]ata1.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 actio
n 0x0
ata1.00: irq_stat 0x40000008
ata1.00: cmd 60/0f:00:d9:09:39/00:00:33:00:00/40 tag 0 ncq 7680 in
         res 41/40:00:e4:09:39/ff:00:33:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
Jul 29 07:32:59 dao kernel: ata1.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0^M
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
[Mon Jul 29 15:20:57 2013]ata1.00: exception Emask 0x0 SAct 0x1fe SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
ata1.00: cmd 60/0f:40:d9:09:39/00:00:33:00:00/40 tag 8 ncq 7680 in
         res 41/40:00:e4:09:39/ff:00:33:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x0
[Mon Jul 29 15:21:00 2013]ata1.00: irq_stat 0x40000008
ata1.00: cmd 60/0f:00:d9:09:39/00:00:33:00:00/40 tag 0 ncq 7680 in
         res 41/40:00:e4:09:39/ff:00:33:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
:



so it's sda.

Smart confirms it

 

it hung again.  killing sda.

# 1  Short offline       Completed: read failure       10%     13900         859376100



so yeah, uh,

16:43 <+nb> INFO: task blkback.16.xvda:12340 blocked for more than 120 seconds.
16:44 <+nb> ata3.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0
16:44 <+nb> ata3.00: irq_stat 0x40000008
16:44 <+nb> ata3.00: cmd 60/e8:08:e9:91:e5/00:00:08:00:00/40 tag 1 ncq 118784 in
16:44 <+nb>          res 41/40:00:04:92:e5/00:00:08:00:00/40 Emask 0x409 (media
            error) <F>
16:44 <+nb> ata3.00: status: { DRDY ERR }
16:44 <+nb> [Mon Jul 22 16:49:59 2013]ata3.00: error: { UNC }
16:44 <+nb> Jul 22 09:01:51 black kernel: ata3.00: exception Emask 0x0 SAct 0x7
            SErr 0x0 action 0x0
16:44 <+nb> SCSI device sdc: 3907029168 512-byte hdwr sectors (2000399 MB)
16:44 <+nb> sdc: Write Protect is off
16:44 <+nb> SCSI device sdc: drive cache: write back


16:51 < prgmrcom> nb
16:51 < prgmrcom> oh no
16:52 < prgmrcom> gonna reboot it
16:53 < prgmrcom> fuuuck.  and I paid for the expensive disks that aren't
                  sopposed to do that.  I'm pissed.




but yeah.  the upshot here is that one of our disks went bad... in a way that a disk half as expensive would be expected to go bad.   Not a good morning. 



SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     12687         -
# 2  Short offline       Completed without error       00%     12663         -
# 3  Short offline       Completed without error       00%     12641         -
# 4  Conveyance offline  Completed: read failure       10%     12618         188061143
# 5  Short offline       Aborted by host               10%     12617         -
# 6  Short offline       Completed without error       00%     12595         -
# 7  Short offline       Completed: read failure       10%     12570         188061143
# 8  Short offline       Completed without error       00%     12545         -
# 9  Short offline       Completed without error       00%     12523         -
#10  Short offline       Completed without error       00%     12499         -
#11  Extended offline    Completed without error       00%     12487         -
#12  Short offline       Completed without error       00%     12475         -
#13  Short offline       Completed without error       00%     12457         -
#14  Short offline       Completed without error       00%     12357         -
#15  Short offline       Completed without error       00%     12334         -
#16  Extended offline    Completed without error       00%     12320         -
#17  Short offline       Completed without error       00%     12310         -
#18  Short offline       Completed without error       00%     12287         -
#19  Short offline       Completed without error       00%     12264         -
#20  Short offline       Completed without error       00%     12241         -
#21  Short offline       Completed without error       00%     12221         -



so yeah, I thought I remembered smart errors on sdc (which was the problem in this case)  my plan was to leave the drive in until I bought a replacement, which was clearly a mistake.   yanking the drive and heading to central right now.  

About this Archive

This page is a archive of recent entries written by luke in July 2013.

luke: June 2013 is the previous archive.

luke: October 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.