data recovery update on the KVM (formerly tile networks) guest. Alternate title: Frozen Peas revisited.

| | Comments (0)
data recovery done:


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   981079 MB,  errsize:  19125 MB,  errors:   50456
Current status
rescued:   981079 MB,  errsize:  19125 MB,  current rate:        0 B/s
   ipos:         0 B,   errors:   50456,    average rate:        0 B/s
   opos:         0 B,     time from last successful read:       0 s
Finished

----


The story here is that I bought the assets of tile networks... some time ago.  These assets included two servers full of customers.   Each server had four disks; each disk was mirrored to the second server via drbd, sda to sda, sdb to sdb, etc... 

Anyhow, these servers sat running for several years.  they never got into the monitoring system; the plan was to move people to a new physical server and new billing system.  Meanwhile, the old servers chugged away, and the customers got free service.  learning KVM would be good for me and my development.   Anyhow,   Server 1 failed, and I rushed to setup a new server to move people over.    But I didn't email people, which was a huge mistake, and one could argue, a breach of my responsibility that verges well into the unethical.     

Anyhow, there was delays moving people over, and another disk failed in the meanwhile.   Being as the redundant copies were all on the dead server, we were looking at dataloss.

I spent some time looking at the LV metadata, and was able to re-assemble several guests worth of data right off, however, several other guests looked like they depended on the data on the bad disk.  

As is tradition,  I first stuck the drive in the freezer[1] -  I then hooked it up to ddrescue and let it run for 24 hours.   4.7gb of data.   And the first 4.7gb on the disk, too.      I started sticking it in the freezer, then taking it out again and re-starting ddrescue (always run ddrescue with the log;  it restarts where it left off.)   -  I got a good bit more data.   by the end of the second day, I had almost two hundred and fifty gb.   but then it stopped again.   I found that just power cycling the drive, even without sticking it in the freezer would give me a bit more data... anywhere from a few hundred megs to a few hundred gigabytes.    I spent some time power cycling the drive by hand, but as the amount of data recovered per cycle decreased, I figured this was going to get really labor intensive.  

Now, being me, I have a whole bunch of servertech remotely controlable PDUs  laying about.    I hooked up my usb cradle to one of them;  I was too lazy to configure it for snmp[2] access, so  I used serial.

My workflow is as follows:

For ddrescue, I run in one window  :

while true; do /sbin/ddrescue /dev/sdd sdc.img sdc.log ;sleep 30; done


then, in another window


while true; do

du sdc.img > sz.old
sleep 120;
du sdc.img >sz.new

diff sz.old sz.new
if [ $? -eq 0 ]; then
    echo "disk power off"
    echo 'admn
admn
'>/dev/ttyUSB0
    sleep 2;
    echo 'off .a2
' > /dev/ttyUSB0;
sleep 60;
        echo 'admn
admn
on .a2
' > /dev/ttyUSB0;
    echo "disk power on";
       
fi

done
 



Through this bad hack, I now have all but around 70gb of the bad drive recovered.    Now, it is yet to be seen if this 70gb is the correct 70gb, but my guess is that I'll run this tonight, and I'll be able to recover my last complaining user tomorrow morning.  I might even recover the users that didn't complain.   (I haven't been billing them, so the users who aren't using it anymore mostly just don't use it.)


At this point, each power cycle seems to only get me a few hundred megabytes;  there *will* be *some* data loss, but I've got most of it.  


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   930836 MB,  errsize:  69368 MB,  errors:   34852
Current status
rescued:   931150 MB,  errsize:  69054 MB,  current rate:        0 B/s
   ipos:     7362 MB,   errors:   35206,    average rate:    1473 kB/s
   opos:     7362 MB,     time from last successful read:     2.8 m
Splitting failed blocks...
ddrescue: input file disappeared: No such file or directory
ddrescue: Can't open input file: No such file or directory
ddrescue: Can't open input file: No such file or directory


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   931150 MB,  errsize:  69054 MB,  errors:   35206
Current status
rescued:   931265 MB,  errsize:  68939 MB,  current rate:        0 B/s
   ipos:     8195 MB,   errors:   35311,    average rate:     542 kB/s
   opos:     8195 MB,     time from last successful read:     3.4 m
Splitting failed blocks...
ddrescue: input file disappeared: No such file or directory
ddrescue: Can't open input file: No such file or directory
ddrescue: Can't open input file: No such file or directory


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   931265 MB,  errsize:  68939 MB,  errors:   35311
Current status
rescued:   931385 MB,  errsize:  68819 MB,  current rate:        0 B/s
   ipos:     9380 MB,   errors:   35521,    average rate:     564 kB/s
   opos:     9380 MB,     time from last successful read:     3.3 m
Splitting failed blocks...
ddrescue: input file disappeared: No such file or directory
ddrescue: Can't open input file: No such file or directory
ddrescue: Can't open input file: No such file or directory


Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   931385 MB,  errsize:  68819 MB,  errors:   35521
Current status
rescued:   931621 MB,  errsize:  68583 MB,  current rate:        0 B/s
   ipos:     9930 MB,   errors:   35642,    average rate:    1114 kB/s
   opos:     9930 MB,     time from last successful read:     3.1 m
Splitting failed blocks...

Leave a comment

About this Entry

This page contains a single entry by luke published on July 7, 2014 3:06 AM.

battenberg is back (in jellicoe) was the previous entry in this blog.

Scheduled Downtime halter.prgmr.com & council.prgmr.com - Sat, 12 July 2014 20:00:00 -0700 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.