holmes post-mortem

| | Comments (0)
We were able to do a full recovery on holmes, partially due to dumb luck. In this post I'm going to cover in painful detail step-by-step the recovery procedure for holmes I went through.

Disk configuration: 10 disks, raid 1 on partition 1 of all 10 disks, near-mirroring raid 10 on partition 2 of all 10 disks, lvm running on top of the raid 10, guest volumes running on the lvm.  There is no networked mirror or backup of the guest volumes.

As mentioned in a previous postthis kickstart file to install centos ran.   Some ways we could have stopped the disk from being partitioned / formatted automatically would have been to add a pre-installation script to verify whether this was a fresh install or not, force the disks to be chosen interactively , or as someone suggested add a chain pxe bootloader with a menu to choose between rescue or install.

The install almost finished a format of the raid 1 partition but did not complete it before Luke pulled the power.

My first concern was whether or not we had a copy of the lvm metadata. There were 2 potential sources: the first is the working metadata stored on the physical volumes, the second is a backup kept in /etc/lvm. 

Initially I did not know if the lvm metadata on the raid10 was destroyed or not - it depended  on whether the commands in the kickstart file were run sequentially based on the source file or whether the execution order was controlled by what needed to get done (partitioning should happen before formatting for example.)  

I was waiting for Luke to go with me to the data center so I decided to review the source for anaconda.  I don't know python right now so it was a bit difficult to follow but it was clear that the order was not just sequential - later I found a link confirming it.  Still, the lvm volume didn't have anything to do with formatting / so it was possible the data was still there.

Even so I wanted to get a copy of /etc from the old root partition.  We hadn't been backing up /etc/lvm remotely (this will soon be fixed.)  Since all the root partitions should have been the same we yanked all the disks except one, booted into rescue mode, and started the raid1 degraded.  We marked the raid read-only using mdadm --readonly /dev/md0 and I dd'ed off that partition to memory. 

I ran e2fsck -y on the image and it spewed out a whole bunch of fixed errors and eventually had a segmentation fault.  The version of e2fsck included with the rescue image comes from busybox so I figured there was a good chance that the full version of e2fsck would be able to complete.  I used wget to download a full centos64 image, mounted it loopback and chrooted there.  That version of e2fsck completed successfully. 

The top-level directory names were all gone but otherwise the files were intact. This meant we definitely had a copy of the lvm metadata. I uploaded a tarball of the important files somewhere else.

As with boutros, I wanted to work off of a degraded array with one half of the mirrors and physically leave the other half disconnected until the data recovery was complete (or not.)  Unfortunately the md superblocks were overwritten and we didn't have a backup.  We did have a console log from before holmes was moved, and that prints out which partitions map to what drive slot, but its based on drive letter not serial number and drive letters change.  We didn't have a copy of that information either. 

But, the raid slot -> drive letter mapping was mostly sequential except /dev/sda was not present.  Since Luke hadn't physically moved the drives it was pretty likely the raid had been reinitialized with the same drives mirrored as before.

We put all 10 drives back in.  I used mdev to have the /dev/ entries created and then I used mdadm --examine to examine which drives were currently mirrors of each other. 

Since mdadm metadata is stored at the end of the disk, not the beginning, all the data of each mirror pair should theoretically be identical up until that metadata.  So I used the command "cmp" with no offset to determine the first different byte in between each mirror pair - effectively manually running raid-check. The smallest difference of all the pairs was at about ~22GiB into the partition and there was no way that a raid resync would have gotten that far, so I was fairly confident of the mirror pairs.  If the difference had been earlier, there was still the possibility of avoiding data loss depending on how the swapping and resync had happened.

Though I had high confidence,  I also ran the command "cmp -l -n289251115663 <dev1> <dev2> | wc" to determine the number of bytes that differed across each mirror excluding the md metadata. The counts turned out to be pretty small; the largest was 18KiB. If wc had crashed I would have guessed the differences were pretty large for the ones that crashed.

Now having determined which ones were pairs, I needed to decide which of each of those pairs to initially work off of.  To decide this I looked at these attributes in smartctl:

1 Raw_Read_Error_Rate
5 Reallocated_Sector_Ct
7 Seek_Error_Rate
9 Power_On_Hours (to compare to last test)
196 Reallocated_Event_Count
197 Current_Pending_Sector
198 Offline_Uncorrectable
200 Multi_Zone_Error_Rate
Whether errors were logged
Whether most recent smart test was within 24 hours of the last test

Of these, 1 disk was obviously bad and 3 had minor smart errors that we need to review. But none of them were in the same mirror, so those were the ones that got excluded.  I chose a disk from the last pair at random.

We pulled half of the disks and started a degraded array of the remaining disks and set it read-only.  To avoid more problems with busybox I used the lvm tools in the chroot.  pvscan showed the volume group "guests" had 1.3TB free which meant it had been overwritten by the install process.   If at this point I hadn't been able to fsck the root partition, I would have run strings on the root partition image and then grepped through that file looking for the lvm metadata - since this was ext3 that probably would have worked (I think it might not have worked for ext4 since ext4 has some feature to keep small files in inodes.)

At this point I backed up the first 4KiB from each partition, which would have been useful if we had to try reordering the array later.

After remounting the raid read-write I copied over the old /etc/lvm into the chroot. We used the following commands to restore the lvm metadata:

head /etc/lvm/backup/guests #verify this is the correct metadata
pvcreate -ff --restorefile /etc/lvm/backup/guests --uuid "BuBSom-hBzG-n8o3-V2X9-9Zhp-Ymze-ltP3Ys" /dev/md1 #uuid comes from pv0 id in guests
vgcfgrestore guests

Now that the metadata was reloaded I went back and set the raid readonly again.  Then we activated the first 3 volumes (home,var, and distros.) 

home spit back something about specifying the file system type but we ran e2fsck on /var and /tmp and these gave back something quite reasonable.  This meant that the current first pair of the raid was the same as before.  If e2fsck had spit back complete garbage then one of the guests may have had data loss due to their partition being partially overwritten by lvm metadata.

I activated the rest of the volumes and then ran fdisk -l on all of them on the theory that this would return garbage for some of the guests if the mirror was assembled out of order.  They all checked out, but if we had gotten garbage back, at this point we could have swapped around the order of the last 4 drives without data loss in order to find something that worked.

Being fairly certain all was in order now, we reinstalled/reconfigured holmes but only on 5/10 drives.  I realized after doing this I had forgotten to grab public keys from /home before it was reformatted - this is something we could have recreated but it would have been a hassle.  So we swapped out the current 5 drives with the other 5 drives in order to restore the lvm data, fsck an image of /home, and grab the keys.

We left the array running with only 5/10 drives when we first brought the guests up in case there was some catastrophic error (yes I snooped on consoles at this point, which I on principal typically avoid, but only to verify that at least a login prompt came up.)  There were a couple of people having problems but this appeared to be due to other factors.

Luke is working right now on beefing up our backups, including lsyncd or something else based on inotify such that files are copied off as soon as they are updated, as well as making copies of some of the information about disks which would have made this process more easy.

/Note - we are talking about backups of prgmr.com data, not of customer data.   Please back up your own customer data/ --Luke

We are also discussing different ways in which we might avoid or severely reduce downtime and the possibility of data loss.  This will probably use a combination of local and networked storage, possibly hot spare(s), and possibly snapshots or backups of guests.  But it is not something we're going to implement without a lot of research and testing to make sure that it doesn't cause more problem than it solves.

Leave a comment

About this Entry

This page contains a single entry by srn published on May 29, 2013 6:05 PM.

Holmes crash due to out of SW-IOMMU space was the previous entry in this blog.

javier down is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.