see, this is why I don't assume you are stupid.

| | Comments (1)
so I had a customer the other day complaining that his domain was crashing. Now, his console was spewing oom-killer errors, and it was a 64MiB domain, and the guy was trying to run drupal. (yeah, in 64 MiB ram. laughing would be rude, but I tried to explain that the very idea of drupal in that little ram was silly) So I gave him a little more ram and showed him how to add swap, figuring that would solve his problem.

He came back and complained of another error:

[98570.577133] INFO: task cron:9347 blocked for more than 120 seconds.
[98570.577153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

Now, I haven't seen that before...ever. I could have made some jokes about trying to run drupal in 64GiB ram, but I am unnerved when I see new error messages (I've been doing this for half my life. I've had root on more than 60,000 servers. There are not many error messages I have not seen.)

a little searching came up with some possible debian bugs, but none of my other debian users are having trouble. So I nosed around on the Dom0, and found a bunch of disk errors:


ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (BMDMA stat 0x0)
ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata2: EH complete
raid1: sda2: redirecting sector 133440463 to another mirror
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (BMDMA stat 0x0)
ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata2: EH complete
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
I ran a smart test, and sure enough, /dev/sdb is bad. it failed the 'short' test with a read error. I removed it from the mirror, and will replace it tomorrow. (yes, this means that stables' disk performance will suck this Monday, as the rebuild goes.)

the smartctl output:

SMART Error Log Version: 1
ATA Error Count: 27 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 27 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 b3 38 f4 ec  Error: UNC at LBA = 0x0cf438b3 = 217331891

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 ad 38 f4 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 35 50 f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 ed 4f f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 bd 4f f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 20 cd ec 7c ed 0a  45d+15:35:34.008  READ DMA

Error 26 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:19.338  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:18.368  IDENTIFY DEVICE

Error 25 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:17.418  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:16.448  IDENTIFY DEVICE

Error 24 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:15.488  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:14.648  IDENTIFY DEVICE

Error 23 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:13.648  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:12.608  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%      4884         217331891
# 2  Extended offline    Completed without error       00%         7         -
# 3  Short offline       Completed without error       00%         3         -

SMART Selective self-test log data structure revision number 1
(the '# 1 Short offline Completed: read failure ' makes it absolutely clear that the drive is bad; but usually I return for warranty any disk that has more than zero errors in the 'smart error log')

So yea, if I had ignored him, many others would have suffered similar errors, and well, half-bad disks are much worse than fully-bad disks; they can, in fact, lead to data loss. It's a mirror, so as long as I get a new drive out there tomorrow, nobody should notice (other than, as I said, disk performance sucking during the rebuild tomorrow. I have lowered the rebuild speed, so maybe it won't have such a horrific impact this time. )

1 Comments

And that sir, is why I'd rather host with you than many of the other larger operations out there who assume you're stupid, then ignore you completely when issues are reported. The customer isn't *always* wrong. Granted Drupal in 64MiB is an interesting idea. ;)

Long live Prgmr.

Leave a comment