wilson outage

| | Comments (0)
so looks to me like we've got a bad disk. 


sd 0:0:0:0:
sd 0:0:0:0: timing out command, waited 20s
Oct 20 11:40:12 wilson kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88
003a83a6c0)^M^M
Oct 20 11:40:12 wilson kernel: sd 0:0:0:0: ^M^M
[Sun Oct 20 19:25:40 2013]Oct 20 11:40:12 wilson kernel:         command: ATA co
mmand pass through(16): 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00^M^M
Oct 20 11:40:12 wilson kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc
=ffff88003a83a6c0)^M^M

....

Oct 20 18:10:23 wilson kernel: sd 0:0:0:0: timing out command, waited 20s^M^M
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
raid10: sda2: rescheduling sector 2741196983
[Mon Oct 21 03:45:59 2013]raid10: sda2: rescheduling sector 2741196991
raid10: sda2: rescheduling sector 2741196999
raid10: sda2: rescheduling sector 2741197007
raid10: sda2: rescheduling sector 2741197015
raid10: sda2: rescheduling sector 2741197023
raid10: sda2: rescheduling sector 2741197031
raid10: sda2: rescheduling sector 2741197039
sd 0:0:0:0: rejecting I/O to device being removed
sd 0:0:0:0: rejecting I/O to device being removed
sd 0:0:0:0: rejecting I/O to device being removed
[Mon Oct 21 03:45:59 2013]raid1: Disk failure on sda1, disabling device.
        Operation continuing on 7 devices


...




  Vendor: ATA       Model: ST31000340NS      Rev: SN16
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdk: 1953525168 512-byte hdwr sectors (1000205 MB)
sdk: Write Protect is off
SCSI device sdk: drive cache: write back
SCSI device sdk: 1953525168 512-byte hdwr sectors (1000205 MB)
sdk: Write Protect is off
SCSI device sdk: drive cache: write back
[Mon Oct 21 03:46:11 2013]sd 0:0:5:0: Attached scsi disk sdk
sd 0:0:5:0: Attached scsi generic sg0 type 0
INFO: task md1_raid10:752 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md1_raid10    D ffff88003eb57628     0   752     35           755   713 (L-TLB)
 ffff88003e3f1dc0  0000000000000246  0000000000000000  0000000000000400
 000000000000000a  ffff88003f2eb0c0  ffff88003e2e87e0  0000000000001577
 ffff88003f2eb2a8  0000000000000000
Call Trace:
 [<ffffffff8022f2b1>] __wake_up+0x38/0x4f
[Mon Oct 21 03:48:43 2013] [<ffffffff8817e875>] :raid10:raid10d+0x558/0x9bb
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8029ed86>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8028abb2>] default_wake_function+0x0/0xe
 [<ffffffff8029ed86>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8026365a>] schedule_timeout+0x1e/0xad
 [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff8029ed86>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80415359>] md_thread+0xf8/0x10e
 [<ffffffff8029ef9e>] autoremove_wake_function+0x0/0x2e
[Mon Oct 21 03:48:43 2013] [<ffffffff80415261>] md_thread+0x0/0x10e



...


after reboot, I see

md: bind<sdg1>
md: bind<sdh1>
md: running: <sdh1><sdg1><sdf1><sde1><sdd1><sdc1><sdb1><sda1>
md: kicking non-fresh sda1 from array!
md: unbind<sda1>
md: export_rdev(sda1)


which means it's booting without sda. 

uh, but not in md1.  gah.  I will have to fail sda out of md1.  man...

the machine is taking forever and a half to fsck /var, which is on md1.   I will report back.



....


I rebooted the server and replaced the drive.  users should be up again.  

Leave a comment

About this Entry

This page contains a single entry by luke published on October 20, 2013 9:40 PM.

We have successfully moved off of chariot completely was the previous entry in this blog.

Outage on White is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.