July 2014 Archives

There is a scheduled maintenance happening on Sat, 12 July 2014 20:00:00 -0700  on hosts
council.prgmr.com and halter.prgmr.com. There will be a downtime of no more than three
hours for your server to be rebooted.
This is routine maintenance and we do not expect any issues, but as always we suggest 
that you backup your data and make sure that your VPS is prepared for a shutdown.
We may need to look into your console to confirm you are back online.  If you do not want us to look inside of your console, let us know at support@prgmr.com and we can work out the details with you.

*edit*

They are both back up and the new drives are rebuilding.  If you see any issues with your server, make sure to contact support@prgmr.com with your issues.
data recovery done:


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   981079 MB,  errsize:  19125 MB,  errors:   50456
Current status
rescued:   981079 MB,  errsize:  19125 MB,  current rate:        0 B/s
   ipos:         0 B,   errors:   50456,    average rate:        0 B/s
   opos:         0 B,     time from last successful read:       0 s
Finished

----


The story here is that I bought the assets of tile networks... some time ago.  These assets included two servers full of customers.   Each server had four disks; each disk was mirrored to the second server via drbd, sda to sda, sdb to sdb, etc... 

Anyhow, these servers sat running for several years.  they never got into the monitoring system; the plan was to move people to a new physical server and new billing system.  Meanwhile, the old servers chugged away, and the customers got free service.  learning KVM would be good for me and my development.   Anyhow,   Server 1 failed, and I rushed to setup a new server to move people over.    But I didn't email people, which was a huge mistake, and one could argue, a breach of my responsibility that verges well into the unethical.     

Anyhow, there was delays moving people over, and another disk failed in the meanwhile.   Being as the redundant copies were all on the dead server, we were looking at dataloss.

I spent some time looking at the LV metadata, and was able to re-assemble several guests worth of data right off, however, several other guests looked like they depended on the data on the bad disk.  

As is tradition,  I first stuck the drive in the freezer[1] -  I then hooked it up to ddrescue and let it run for 24 hours.   4.7gb of data.   And the first 4.7gb on the disk, too.      I started sticking it in the freezer, then taking it out again and re-starting ddrescue (always run ddrescue with the log;  it restarts where it left off.)   -  I got a good bit more data.   by the end of the second day, I had almost two hundred and fifty gb.   but then it stopped again.   I found that just power cycling the drive, even without sticking it in the freezer would give me a bit more data... anywhere from a few hundred megs to a few hundred gigabytes.    I spent some time power cycling the drive by hand, but as the amount of data recovered per cycle decreased, I figured this was going to get really labor intensive.  

Now, being me, I have a whole bunch of servertech remotely controlable PDUs  laying about.    I hooked up my usb cradle to one of them;  I was too lazy to configure it for snmp[2] access, so  I used serial.

My workflow is as follows:

For ddrescue, I run in one window  :

while true; do /sbin/ddrescue /dev/sdd sdc.img sdc.log ;sleep 30; done


then, in another window


while true; do

du sdc.img > sz.old
sleep 120;
du sdc.img >sz.new

diff sz.old sz.new
if [ $? -eq 0 ]; then
    echo "disk power off"
    echo 'admn
admn
'>/dev/ttyUSB0
    sleep 2;
    echo 'off .a2
' > /dev/ttyUSB0;
sleep 60;
        echo 'admn
admn
on .a2
' > /dev/ttyUSB0;
    echo "disk power on";
       
fi

done
 



Through this bad hack, I now have all but around 70gb of the bad drive recovered.    Now, it is yet to be seen if this 70gb is the correct 70gb, but my guess is that I'll run this tonight, and I'll be able to recover my last complaining user tomorrow morning.  I might even recover the users that didn't complain.   (I haven't been billing them, so the users who aren't using it anymore mostly just don't use it.)


At this point, each power cycle seems to only get me a few hundred megabytes;  there *will* be *some* data loss, but I've got most of it.  


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   930836 MB,  errsize:  69368 MB,  errors:   34852
Current status
rescued:   931150 MB,  errsize:  69054 MB,  current rate:        0 B/s
   ipos:     7362 MB,   errors:   35206,    average rate:    1473 kB/s
   opos:     7362 MB,     time from last successful read:     2.8 m
Splitting failed blocks...
ddrescue: input file disappeared: No such file or directory
ddrescue: Can't open input file: No such file or directory
ddrescue: Can't open input file: No such file or directory


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   931150 MB,  errsize:  69054 MB,  errors:   35206
Current status
rescued:   931265 MB,  errsize:  68939 MB,  current rate:        0 B/s
   ipos:     8195 MB,   errors:   35311,    average rate:     542 kB/s
   opos:     8195 MB,     time from last successful read:     3.4 m
Splitting failed blocks...
ddrescue: input file disappeared: No such file or directory
ddrescue: Can't open input file: No such file or directory
ddrescue: Can't open input file: No such file or directory


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   931265 MB,  errsize:  68939 MB,  errors:   35311
Current status
rescued:   931385 MB,  errsize:  68819 MB,  current rate:        0 B/s
   ipos:     9380 MB,   errors:   35521,    average rate:     564 kB/s
   opos:     9380 MB,     time from last successful read:     3.3 m
Splitting failed blocks...
ddrescue: input file disappeared: No such file or directory
ddrescue: Can't open input file: No such file or directory
ddrescue: Can't open input file: No such file or directory


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   931385 MB,  errsize:  68819 MB,  errors:   35521
Current status
rescued:   931621 MB,  errsize:  68583 MB,  current rate:        0 B/s
   ipos:     9930 MB,   errors:   35642,    average rate:    1114 kB/s
   opos:     9930 MB,     time from last successful read:     3.1 m
Splitting failed blocks...
nothing changes on your end;  on my end I need to make sure that I re-label the hardware.   The server that was going to be jellicoe now houses the disks that were battenberg.  


The whole thing is up and running and looking good.

[lsc@battenberg auto]$ cat /proc/cpuinfo; cat /proc/mdstat
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 45
model name    :        Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping    : 7
cpu MHz        : 2000.000
cache size    : 20480 KB
physical id    : 0
siblings    : 1
core id        : 0
cpu cores    : 1
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu tsc msr pae cx8 apic mtrr cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc pni vmx est ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
bogomips    : 5005.93
clflush size    : 64
cache_alignment    : 64
address sizes    : 46 bits physical, 48 bits virtual
power management:

Personalities : [raid1] [raid10]
md1 : active raid10 sdh2[1] sdg2[0] sdf2[7] sde2[6] sdd2[5] sdc2[4] sdb2[3] sda2[2]
      7772118016 blocks 256K chunks 2 near-copies [8/8] [UUUUUUUU]
     
md0 : active raid1 sdh1[6] sdg1[0] sdf1[7] sde1[1] sdd1[2] sdc1[3] sdb1[4] sda1[5]
      10482304 blocks [8/8] [UUUUUUUU]
     
unused devices: <none>

letting battenberg rebuild

| | Comments (0)
Personalities : [raid1] [raid10]
md1 : active raid10 sda2[0] sdh2[7] sdg2[6] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1]
      7772118016 blocks 256K chunks 2 near-copies [8/8] [UUUUUUUU]
      [===>.................]  resync = 18.9% (1469509312/7772118016) finish=245.7min speed=427408K/sec

Why am I not rebuilding online?  because fear.  two disks failed out of the raid at once on the old system. 

Edit: 

Yeah, so the rebuild is almost done, and just like last time I rebuilt offline, I think I may have made a mistake.   The thing was?  the first time I rebooted battenburg, things came back up, only for two disks to fall out of the raid during rebuild, causing another reboot.  

If the problem was load-related, my reaction would have been rational.  However, I don't believe the problem was load-related, so I probably should have verified that the system was up and running on the new system, rebooted, and let the thing rebuild online.


here is where we stand now:



(Repair filesystem) 132 # cat /proc/mdstat
Personalities : [raid1] [raid10]
md1 : active raid10 sda2[0] sdh2[7] sdg2[6] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1]
      7772118016 blocks 256K chunks 2 near-copies [8/8] [UUUUUUUU]
      [===================>.]  resync = 98.8% (7683022720/7772118016) finish=5.4min speed=272570K/sec





Yeah, so the disks from battenberg?  they are now in Jellico.   It's a very similar server, but with a different motherboard/HBA/backplane, so all the components that likely caused the problem are new. 
See errors below.  Either we have a bunch of bad drives all at once (unlikely)  or something is wrong with the sata backplane or HBA.   I have a spare identical server and will move the drives.   


ata6.00: irq_stat 0x00400040, connection status changed
ata6: SError: { PHYRdyChg 10B8B DevExch }
ata6.00: cmd 60/80:00:d9:49:fc/00:00:14:00:00/40 tag 0 ncq 65536 in
         res 40/00:04:d9:49:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata6.00: status: { DRDY }
ata6.00: cmd 60/80:08:59:4a:fc/00:00:14:00:00/40 tag 1 ncq 65536 in
         res 40/00:04:d9:49:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
[Fri Jul  4 20:19:23 2014]ata6.00: status: { DRDY }
ata6.00: cmd 60/80:10:d9:4a:fc/00:00:14:00:00/40 tag 2 ncq 65536 in
         res 40/00:04:d9:49:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata6.00: status: { DRDY }
ata4.00: exception Emask 0x10 SAct 0x1 SErr 0x4090000 action 0xe frozen
ata4.00: irq_stat 0x00400040, connection status changed
ata4: SError: { PHYRdyChg 10B8B DevExch }
ata4.00: cmd 60/80:00:59:4b:fc/00:00:14:00:00/40 tag 0 ncq 65536 in
         res 40/00:04:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata4.00: status: { DRDY }
[Fri Jul  4 20:19:23 2014]ata5.00: exception Emask 0x10 SAct 0x6 SErr 0x4090000 action 0xe frozen
ata5.00: irq_stat 0x00400040, connection status changed
ata5: SError: { PHYRdyChg 10B8B DevExch }
ata5.00: cmd 60/80:08:59:4a:fc/00:00:14:00:00/40 tag 1 ncq 65536 in
         res 40/00:14:d9:4a:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
:


[Fri Jul  4 20:19:23 2014]ata3.00: irq_stat 0x0c400040, interface fatal error, connection status changed
ata3: SError: { PHYRdyChg 10B8B LinkSeq DevExch }
ata3.00: cmd 60/80:00:d9:3f:fc/00:00:14:00:00/40 tag 0 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:08:59:40:fc/00:00:14:00:00/40 tag 1 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:10:d9:41:fc/00:00:14:00:00/40 tag 2 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
[Fri Jul  4 20:19:23 2014]ata3.00: status: { DRDY }
ata3.00: cmd 60/80:18:59:42:fc/00:00:14:00:00/40 tag 3 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:20:d9:42:fc/00:00:14:00:00/40 tag 4 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }


         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:38:59:44:fc/00:00:14:00:00/40 tag 7 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:40:d9:44:fc/00:00:14:00:00/40 tag 8 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:48:d9:45:fc/00:00:14:00:00/40 tag 9 ncq 65536 in
[Fri Jul  4 20:19:23 2014]         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:50:59:46:fc/00:00:14:00:00/40 tag 10 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:58:59:45:fc/00:00:14:00:00/40 tag 11 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }


[Fri Jul  4 20:19:23 2014]ata3.00: status: { DRDY }
ata3.00: cmd 60/80:68:d9:49:fc/00:00:14:00:00/40 tag 13 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:70:d9:46:fc/00:00:14:00:00/40 tag 14 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:78:59:4a:fc/00:00:14:00:00/40 tag 15 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
[Fri Jul  4 20:19:24 2014]ata3.00: cmd 60/80:80:d9:4a:fc/00:00:14:00:00/40 tag 16 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:88:d9:40:fc/00:00:14:00:00/40 tag 17 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:90:59:41:fc/00:00:14:00:00/40 tag 18 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:98:59:48:fc/00:00:14:00:00/40 tag 19 ncq 65536 in


[Fri Jul  4 20:19:24 2014]         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:a0:59:49:fc/00:00:14:00:00/40 tag 20 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:a8:59:47:fc/00:00:14:00:00/40 tag 21 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:b0:59:4b:fc/00:00:14:00:00/40 tag 22 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
[Fri Jul  4 20:19:24 2014]ata3.00: status: { DRDY }
ata3.00: cmd 60/80:c8:59:3f:fc/00:00:14:00:00/40 tag 25 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata3.00: cmd 60/80:d0:d9:48:fc/00:00:14:00:00/40 tag 26 ncq 65536 in
         res 40/00:b4:59:4b:fc/00:00:14:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }
ata7: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
ata7: irq_stat 0x00400040, connection status changed
ata7: SError: { PHYRdyChg 10B8B DevExch }
[Fri Jul  4 20:19:24 2014]Jul  4 12:41:51 battenberg kernel: ata6.00: exception Emask 0x10 SAct 0x7 SErr 0x4090000 action 0xe frozen^M

battenberg raid rebuilding

| | Comments (0)
expect slow I/O for a while.

problems with battenberg

| | Comments (0)
UPDATE 11:45 PST (-7) : it looks like two disks fell out of the raid for as of yet undetermined reasons, though the disks are good.  Everyone should be coming back up now.
---
We have as yet unknown problems with battenberg; remote access is not working so Luke is heading to the data center right now.

Edit from luke, 11:58 PST:

it's back.   it had failed two disks out of the raid10, and froze as it should have.   Upon reboot, it brought one of the disks back and is currently rebuilding, as it should.   All guests should be back now.  

There are three troubling things about this outage. 

1. how long have the disks been bad?   I didn't get an email (we're still looking for the emails from this morning, and so far I only found one;  there should have been four, even if only one disk not two failed this morning.)

2. are the disks actually bad?    SMART mentions no reallocated sectors,   and usually when disks fail, there's a reallocated sector in smart.


3. the logging console wasn't hooked up and logging.  This was an oversight on my part, and a big one.  If I had the logging console hooked up, I would be able to tell you with a reasonable degree of confidence what actually happened.   The sad thing is that I had the server configured to use the SoL console, which is the hard part, but I hadn't configured it in conserver, which does the logging.     I will spend some time today on making sure I have console logging everywhere.   

About this Archive

This page is an archive of entries from July 2014 listed from newest to oldest.

June 2014 is the previous archive.

September 2014 is the next archive.

Find recent content on the main index or look in the archives to find all content.