srn: May 2013 Archives

holmes post-mortem

| | Comments (0)
We were able to do a full recovery on holmes, partially due to dumb luck. In this post I'm going to cover in painful detail step-by-step the recovery procedure for holmes I went through.

Disk configuration: 10 disks, raid 1 on partition 1 of all 10 disks, near-mirroring raid 10 on partition 2 of all 10 disks, lvm running on top of the raid 10, guest volumes running on the lvm.  There is no networked mirror or backup of the guest volumes.

As mentioned in a previous postthis kickstart file to install centos ran.   Some ways we could have stopped the disk from being partitioned / formatted automatically would have been to add a pre-installation script to verify whether this was a fresh install or not, force the disks to be chosen interactively , or as someone suggested add a chain pxe bootloader with a menu to choose between rescue or install.

The install almost finished a format of the raid 1 partition but did not complete it before Luke pulled the power.

My first concern was whether or not we had a copy of the lvm metadata. There were 2 potential sources: the first is the working metadata stored on the physical volumes, the second is a backup kept in /etc/lvm. 

Initially I did not know if the lvm metadata on the raid10 was destroyed or not - it depended  on whether the commands in the kickstart file were run sequentially based on the source file or whether the execution order was controlled by what needed to get done (partitioning should happen before formatting for example.)  

I was waiting for Luke to go with me to the data center so I decided to review the source for anaconda.  I don't know python right now so it was a bit difficult to follow but it was clear that the order was not just sequential - later I found a link confirming it.  Still, the lvm volume didn't have anything to do with formatting / so it was possible the data was still there.

Even so I wanted to get a copy of /etc from the old root partition.  We hadn't been backing up /etc/lvm remotely (this will soon be fixed.)  Since all the root partitions should have been the same we yanked all the disks except one, booted into rescue mode, and started the raid1 degraded.  We marked the raid read-only using mdadm --readonly /dev/md0 and I dd'ed off that partition to memory. 

I ran e2fsck -y on the image and it spewed out a whole bunch of fixed errors and eventually had a segmentation fault.  The version of e2fsck included with the rescue image comes from busybox so I figured there was a good chance that the full version of e2fsck would be able to complete.  I used wget to download a full centos64 image, mounted it loopback and chrooted there.  That version of e2fsck completed successfully. 

The top-level directory names were all gone but otherwise the files were intact. This meant we definitely had a copy of the lvm metadata. I uploaded a tarball of the important files somewhere else.

As with boutros, I wanted to work off of a degraded array with one half of the mirrors and physically leave the other half disconnected until the data recovery was complete (or not.)  Unfortunately the md superblocks were overwritten and we didn't have a backup.  We did have a console log from before holmes was moved, and that prints out which partitions map to what drive slot, but its based on drive letter not serial number and drive letters change.  We didn't have a copy of that information either. 

But, the raid slot -> drive letter mapping was mostly sequential except /dev/sda was not present.  Since Luke hadn't physically moved the drives it was pretty likely the raid had been reinitialized with the same drives mirrored as before.

We put all 10 drives back in.  I used mdev to have the /dev/ entries created and then I used mdadm --examine to examine which drives were currently mirrors of each other. 

Since mdadm metadata is stored at the end of the disk, not the beginning, all the data of each mirror pair should theoretically be identical up until that metadata.  So I used the command "cmp" with no offset to determine the first different byte in between each mirror pair - effectively manually running raid-check. The smallest difference of all the pairs was at about ~22GiB into the partition and there was no way that a raid resync would have gotten that far, so I was fairly confident of the mirror pairs.  If the difference had been earlier, there was still the possibility of avoiding data loss depending on how the swapping and resync had happened.

Though I had high confidence,  I also ran the command "cmp -l -n289251115663 <dev1> <dev2> | wc" to determine the number of bytes that differed across each mirror excluding the md metadata. The counts turned out to be pretty small; the largest was 18KiB. If wc had crashed I would have guessed the differences were pretty large for the ones that crashed.

Now having determined which ones were pairs, I needed to decide which of each of those pairs to initially work off of.  To decide this I looked at these attributes in smartctl:

1 Raw_Read_Error_Rate
5 Reallocated_Sector_Ct
7 Seek_Error_Rate
9 Power_On_Hours (to compare to last test)
196 Reallocated_Event_Count
197 Current_Pending_Sector
198 Offline_Uncorrectable
200 Multi_Zone_Error_Rate
--
Whether errors were logged
Whether most recent smart test was within 24 hours of the last test

Of these, 1 disk was obviously bad and 3 had minor smart errors that we need to review. But none of them were in the same mirror, so those were the ones that got excluded.  I chose a disk from the last pair at random.

We pulled half of the disks and started a degraded array of the remaining disks and set it read-only.  To avoid more problems with busybox I used the lvm tools in the chroot.  pvscan showed the volume group "guests" had 1.3TB free which meant it had been overwritten by the install process.   If at this point I hadn't been able to fsck the root partition, I would have run strings on the root partition image and then grepped through that file looking for the lvm metadata - since this was ext3 that probably would have worked (I think it might not have worked for ext4 since ext4 has some feature to keep small files in inodes.)

At this point I backed up the first 4KiB from each partition, which would have been useful if we had to try reordering the array later.

After remounting the raid read-write I copied over the old /etc/lvm into the chroot. We used the following commands to restore the lvm metadata:

head /etc/lvm/backup/guests #verify this is the correct metadata
pvcreate -ff --restorefile /etc/lvm/backup/guests --uuid "BuBSom-hBzG-n8o3-V2X9-9Zhp-Ymze-ltP3Ys" /dev/md1 #uuid comes from pv0 id in guests
vgcfgrestore guests

Now that the metadata was reloaded I went back and set the raid readonly again.  Then we activated the first 3 volumes (home,var, and distros.) 

home spit back something about specifying the file system type but we ran e2fsck on /var and /tmp and these gave back something quite reasonable.  This meant that the current first pair of the raid was the same as before.  If e2fsck had spit back complete garbage then one of the guests may have had data loss due to their partition being partially overwritten by lvm metadata.

I activated the rest of the volumes and then ran fdisk -l on all of them on the theory that this would return garbage for some of the guests if the mirror was assembled out of order.  They all checked out, but if we had gotten garbage back, at this point we could have swapped around the order of the last 4 drives without data loss in order to find something that worked.

Being fairly certain all was in order now, we reinstalled/reconfigured holmes but only on 5/10 drives.  I realized after doing this I had forgotten to grab public keys from /home before it was reformatted - this is something we could have recreated but it would have been a hassle.  So we swapped out the current 5 drives with the other 5 drives in order to restore the lvm data, fsck an image of /home, and grab the keys.

We left the array running with only 5/10 drives when we first brought the guests up in case there was some catastrophic error (yes I snooped on consoles at this point, which I on principal typically avoid, but only to verify that at least a login prompt came up.)  There were a couple of people having problems but this appeared to be due to other factors.

Luke is working right now on beefing up our backups, including lsyncd or something else based on inotify such that files are copied off as soon as they are updated, as well as making copies of some of the information about disks which would have made this process more easy.

/Note - we are talking about backups of prgmr.com data, not of customer data.   Please back up your own customer data/ --Luke

We are also discussing different ways in which we might avoid or severely reduce downtime and the possibility of data loss.  This will probably use a combination of local and networked storage, possibly hot spare(s), and possibly snapshots or backups of guests.  But it is not something we're going to implement without a lot of research and testing to make sure that it doesn't cause more problem than it solves.
We brought everyone back up after Sunday's woes and then shortly had a crash similar to one we had in February on its partner taft (see http://blog.prgmr.com/xenophilia/2013/02/taft-rebooted-itself.html) 

The domains are coming back now, or should already be up (this downtime was probably on the order of half an hour.)

We followed the advice of this email http://old-list-archives.xenproject.org/archives/html/xen-devel/2007-09/msg00140.html and doubled the size of the SW-IOMMU by adding swiotlb=128 to the dom0 command line.

Before:
May 28 23:31:04 holmes kernel: Software IO TLB enabled:
May 28 23:31:04 holmes kernel:  Aperture:     64 megabytes

After:
May 29 01:13:27 holmes kernel: Software IO TLB enabled:
May 29 01:13:27 holmes kernel:  Aperture:     128 megabytes

This was the crash:

PCI-DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
[Wed May 29 08:29:09 2013]PCI-DMA: Out of SW-IOMMU space for 65536 bytes at d
evice 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 16384 bytes at device 0000:04:00.0
Unable to handle kernel paging request at ffff880040000010 RIP:
 [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
PGD 141f067 PUD 1621067 PMD 1622067 PTE 0
Oops: 0000 [1] SMP
last sysfs file: /block/md0/md/metadata_version
CPU 0
Modules linked in: ebt_arp ebt_ip ebtable_filter ebtables xt_physdev netloop
netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat bridge lockd sunrpc cpuf
req_ondemand acpi_cpufreq freq_table mperf ip_conntrack_netbios_ns ipt_REJECT
 xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpu
dp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm i
b_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api u
io cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 s
csi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmo
n i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport i2c_i
801 i2c_core sr_mod cdrom joydev sg pcspkr e1000e serial_core tpm_tis tpm tpm
_bios serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot
dm_zero dm_mirror dm_log dm_mod usb_storage ahci libata raid10 shpchp mpt2sas
 scsi_transport_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 464, comm: md0_raid1 Not tainted 2.6.18-348.6.
1.el5xen #1
RIP: e030:[<ffffffff880e6ffd>]  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
RIP: e030:[<ffffffff880e6ffd>]  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
RSP: e02b:ffffffff807a1d20  EFLAGS: 00010002
RAX: 0000000000000008 RBX: 0000000000000003 RCX: ffffffff880d7057
RDX: 2020202020202045 RSI: 0000000034202020 RDI: ffff88003e71fed8
RBP: ffff880040000000 R08: ffff880006658000 R09: ffff880006658000
R10: 0000001796adc000 R11: 0000000000000000 R12: ffff8800220e6e40
R13: ffff88003e7222f8 R14: ffff88003e9704f8 R15: ffff88003e71fee0
FS:  00002b3ead0801e0(0000) GS:ffffffff80639000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
[Wed May 29 08:29:09 2013]Process md0_raid1 (pid: 464, threadinfo ffff88003e1a0000, task ffff88003ee6c080)
Stack:  ffff88003f721418  ffff88003efe1000  00000bc18085b280  ffff88003ea5e0c0
 ffff88003ea5e080  94000000d5000000  fffffff4fffe3680  0000000300000bc1
 140000000000000f  ffff8800220e6e40
Call Trace:
 <IRQ>  [<ffffffff88084dbb>] :scsi_mod:scsi_dispatch_cmd+0x2ac/0x366
 [<ffffffff8808a506>] :scsi_mod:scsi_request_fn+0x2c7/0x39e
 [<ffffffff8025ebf5>] blk_run_queue+0x41/0x73
 [<ffffffff880893cd>] :scsi_mod:scsi_next_command+0x2d/0x39
 [<ffffffff8808954e>] :scsi_mod:scsi_end_request+0xbf/0xcd
[Wed May 29 08:29:09 2013] [<ffffffff880896da>] :scsi_mod:scsi_io_completion+0x17e/0x339
 [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff880b67ce>] :sd_mod:sd_rw_intr+0x21e/0x258
 [<ffffffff88089956>] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [<ffffffff80239541>] blk_done_softirq+0x67/0x75
 [<ffffffff80212f3a>] __do_softirq+0x8d/0x13b
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026eb41>] do_softirq+0x31/0x90
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff8034526a>] cfq_latter_request+0x0/0x1e
[<ffffffff8021cb71>] generic_make_request+0xb6/0x228
 [<ffffffff88075bda>] :raid1:flush_pending_writes+0x6d/0x8e
 [<ffffffff88076bf4>] :raid1:raid1d+0x39/0xbc3
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8026365a>] schedule_timeout+0x1e/0xad
 [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
[<ffffffff804153c7>] md_thread+0xf8/0x10e
 [<ffffffff8029f015>] autoremove_wake_function+0x0/0x2e
 [<ffffffff804152cf>] md_thread+0x0/0x10e
 [<ffffffff80233eb3>] kthread+0xfe/0x132
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233db5>] kthread+0x0/0x132
 [<ffffffff80260b22>] child_rip+0x0/0x12
Code: 48 8b 55 10 49 8b 8e c8 03 00 00 75 06 8b 74 24 2c eb 04 8b
RIP  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
 RSP <ffffffff807a1d20>
CR2: ffff880040000010
 <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Update on holmes

| | Comments (0)
The results so far look promising.  If we are interpreting things correctly there should be no more data loss than on any unclean shutdown. But in order to avoid the same sort of mistakes that caused the original situation, we are taking extra time in order to minimize the possibility that something else goes wrong.  We should know one way or another within the next day.

dhcp problems (now fixed)

| | Comments (0)
The following subnets may have had issues acquiring a dhcp lease from approximately 03:30 to 14:30 PST 2013-05-21:

71.19.145
71.17.146
71.19.147
71.19.149
71.19.150
71.19.154
71.19.156
71.19.157

The problem is we are running dhcrelay on those subnets, which forwards dhcp requests to another server, and they were using a cached dns lookup which pointed to an old ip address rather than the current for the dhcp server.

unplanned downtime in santa clara

| | Comments (0)
UPDATE 2:49 PST: All servers should be back up.

UPDATE: The problem was a dedicated server customer plugged in a computer with a configurable power supply at the wrong setting (120V instead of 240V) and it blew out the pdu.  We picked up another pdu from the office and are installing it right now.
 
Following servers are down, starting at about 01:10 PST:

sword
chariot
branch
seashell
knife
cauldron
chessboard
horn
fuller
coins
council
bowl
waite
whetstone
taney
mantle
lozenges
rutledge
pearl

Reasons are unknown (to me at least) at this time - will post with further updates as I have them.

/etc/resolv.conf

| | Comments (0)
Please replace any existing nameserver entries with the following:

nameserver 71.19.155.120
nameserver 71.19.145.215


In short, for those people who have been renumbered

username.xen.prgmr.com is becoming username.old.xen.prgmr.com
username.new.xen.prgmr.com is becoming username.xen.prgmr.com

reverse dns mappings should be preserved.

The old IPs are only around for another day or so, so if you are not pingable on username.new.xen.prgmr.com PLEASE take action. 

We intend to send a follow up email for those vps's which are still not responding but you are responsible for making sure your server has the correct network configuration, we do not edit files on your machine. 

As a side note, if you do not remember which server your vps is hosted on, current users should be able to log in as <user>@<user>.console.xen.prgmr.com

About this Archive

This page is a archive of recent entries written by srn in May 2013.

srn: April 2013 is the previous archive.

srn: June 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.