May 2013 Archives

holmes post-mortem

| | Comments (0)
We were able to do a full recovery on holmes, partially due to dumb luck. In this post I'm going to cover in painful detail step-by-step the recovery procedure for holmes I went through.

Disk configuration: 10 disks, raid 1 on partition 1 of all 10 disks, near-mirroring raid 10 on partition 2 of all 10 disks, lvm running on top of the raid 10, guest volumes running on the lvm.  There is no networked mirror or backup of the guest volumes.

As mentioned in a previous postthis kickstart file to install centos ran.   Some ways we could have stopped the disk from being partitioned / formatted automatically would have been to add a pre-installation script to verify whether this was a fresh install or not, force the disks to be chosen interactively , or as someone suggested add a chain pxe bootloader with a menu to choose between rescue or install.

The install almost finished a format of the raid 1 partition but did not complete it before Luke pulled the power.

My first concern was whether or not we had a copy of the lvm metadata. There were 2 potential sources: the first is the working metadata stored on the physical volumes, the second is a backup kept in /etc/lvm. 

Initially I did not know if the lvm metadata on the raid10 was destroyed or not - it depended  on whether the commands in the kickstart file were run sequentially based on the source file or whether the execution order was controlled by what needed to get done (partitioning should happen before formatting for example.)  

I was waiting for Luke to go with me to the data center so I decided to review the source for anaconda.  I don't know python right now so it was a bit difficult to follow but it was clear that the order was not just sequential - later I found a link confirming it.  Still, the lvm volume didn't have anything to do with formatting / so it was possible the data was still there.

Even so I wanted to get a copy of /etc from the old root partition.  We hadn't been backing up /etc/lvm remotely (this will soon be fixed.)  Since all the root partitions should have been the same we yanked all the disks except one, booted into rescue mode, and started the raid1 degraded.  We marked the raid read-only using mdadm --readonly /dev/md0 and I dd'ed off that partition to memory. 

I ran e2fsck -y on the image and it spewed out a whole bunch of fixed errors and eventually had a segmentation fault.  The version of e2fsck included with the rescue image comes from busybox so I figured there was a good chance that the full version of e2fsck would be able to complete.  I used wget to download a full centos64 image, mounted it loopback and chrooted there.  That version of e2fsck completed successfully. 

The top-level directory names were all gone but otherwise the files were intact. This meant we definitely had a copy of the lvm metadata. I uploaded a tarball of the important files somewhere else.

As with boutros, I wanted to work off of a degraded array with one half of the mirrors and physically leave the other half disconnected until the data recovery was complete (or not.)  Unfortunately the md superblocks were overwritten and we didn't have a backup.  We did have a console log from before holmes was moved, and that prints out which partitions map to what drive slot, but its based on drive letter not serial number and drive letters change.  We didn't have a copy of that information either. 

But, the raid slot -> drive letter mapping was mostly sequential except /dev/sda was not present.  Since Luke hadn't physically moved the drives it was pretty likely the raid had been reinitialized with the same drives mirrored as before.

We put all 10 drives back in.  I used mdev to have the /dev/ entries created and then I used mdadm --examine to examine which drives were currently mirrors of each other. 

Since mdadm metadata is stored at the end of the disk, not the beginning, all the data of each mirror pair should theoretically be identical up until that metadata.  So I used the command "cmp" with no offset to determine the first different byte in between each mirror pair - effectively manually running raid-check. The smallest difference of all the pairs was at about ~22GiB into the partition and there was no way that a raid resync would have gotten that far, so I was fairly confident of the mirror pairs.  If the difference had been earlier, there was still the possibility of avoiding data loss depending on how the swapping and resync had happened.

Though I had high confidence,  I also ran the command "cmp -l -n289251115663 <dev1> <dev2> | wc" to determine the number of bytes that differed across each mirror excluding the md metadata. The counts turned out to be pretty small; the largest was 18KiB. If wc had crashed I would have guessed the differences were pretty large for the ones that crashed.

Now having determined which ones were pairs, I needed to decide which of each of those pairs to initially work off of.  To decide this I looked at these attributes in smartctl:

1 Raw_Read_Error_Rate
5 Reallocated_Sector_Ct
7 Seek_Error_Rate
9 Power_On_Hours (to compare to last test)
196 Reallocated_Event_Count
197 Current_Pending_Sector
198 Offline_Uncorrectable
200 Multi_Zone_Error_Rate
Whether errors were logged
Whether most recent smart test was within 24 hours of the last test

Of these, 1 disk was obviously bad and 3 had minor smart errors that we need to review. But none of them were in the same mirror, so those were the ones that got excluded.  I chose a disk from the last pair at random.

We pulled half of the disks and started a degraded array of the remaining disks and set it read-only.  To avoid more problems with busybox I used the lvm tools in the chroot.  pvscan showed the volume group "guests" had 1.3TB free which meant it had been overwritten by the install process.   If at this point I hadn't been able to fsck the root partition, I would have run strings on the root partition image and then grepped through that file looking for the lvm metadata - since this was ext3 that probably would have worked (I think it might not have worked for ext4 since ext4 has some feature to keep small files in inodes.)

At this point I backed up the first 4KiB from each partition, which would have been useful if we had to try reordering the array later.

After remounting the raid read-write I copied over the old /etc/lvm into the chroot. We used the following commands to restore the lvm metadata:

head /etc/lvm/backup/guests #verify this is the correct metadata
pvcreate -ff --restorefile /etc/lvm/backup/guests --uuid "BuBSom-hBzG-n8o3-V2X9-9Zhp-Ymze-ltP3Ys" /dev/md1 #uuid comes from pv0 id in guests
vgcfgrestore guests

Now that the metadata was reloaded I went back and set the raid readonly again.  Then we activated the first 3 volumes (home,var, and distros.) 

home spit back something about specifying the file system type but we ran e2fsck on /var and /tmp and these gave back something quite reasonable.  This meant that the current first pair of the raid was the same as before.  If e2fsck had spit back complete garbage then one of the guests may have had data loss due to their partition being partially overwritten by lvm metadata.

I activated the rest of the volumes and then ran fdisk -l on all of them on the theory that this would return garbage for some of the guests if the mirror was assembled out of order.  They all checked out, but if we had gotten garbage back, at this point we could have swapped around the order of the last 4 drives without data loss in order to find something that worked.

Being fairly certain all was in order now, we reinstalled/reconfigured holmes but only on 5/10 drives.  I realized after doing this I had forgotten to grab public keys from /home before it was reformatted - this is something we could have recreated but it would have been a hassle.  So we swapped out the current 5 drives with the other 5 drives in order to restore the lvm data, fsck an image of /home, and grab the keys.

We left the array running with only 5/10 drives when we first brought the guests up in case there was some catastrophic error (yes I snooped on consoles at this point, which I on principal typically avoid, but only to verify that at least a login prompt came up.)  There were a couple of people having problems but this appeared to be due to other factors.

Luke is working right now on beefing up our backups, including lsyncd or something else based on inotify such that files are copied off as soon as they are updated, as well as making copies of some of the information about disks which would have made this process more easy.

/Note - we are talking about backups of data, not of customer data.   Please back up your own customer data/ --Luke

We are also discussing different ways in which we might avoid or severely reduce downtime and the possibility of data loss.  This will probably use a combination of local and networked storage, possibly hot spare(s), and possibly snapshots or backups of guests.  But it is not something we're going to implement without a lot of research and testing to make sure that it doesn't cause more problem than it solves.
We brought everyone back up after Sunday's woes and then shortly had a crash similar to one we had in February on its partner taft (see 

The domains are coming back now, or should already be up (this downtime was probably on the order of half an hour.)

We followed the advice of this email and doubled the size of the SW-IOMMU by adding swiotlb=128 to the dom0 command line.

May 28 23:31:04 holmes kernel: Software IO TLB enabled:
May 28 23:31:04 holmes kernel:  Aperture:     64 megabytes

May 29 01:13:27 holmes kernel: Software IO TLB enabled:
May 29 01:13:27 holmes kernel:  Aperture:     128 megabytes

This was the crash:

PCI-DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
[Wed May 29 08:29:09 2013]PCI-DMA: Out of SW-IOMMU space for 65536 bytes at d
evice 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 16384 bytes at device 0000:04:00.0
Unable to handle kernel paging request at ffff880040000010 RIP:
 [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
PGD 141f067 PUD 1621067 PMD 1622067 PTE 0
Oops: 0000 [1] SMP
last sysfs file: /block/md0/md/metadata_version
Modules linked in: ebt_arp ebt_ip ebtable_filter ebtables xt_physdev netloop
netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat bridge lockd sunrpc cpuf
req_ondemand acpi_cpufreq freq_table mperf ip_conntrack_netbios_ns ipt_REJECT
 xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpu
dp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm i
b_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api u
io cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 s
csi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmo
n i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport i2c_i
801 i2c_core sr_mod cdrom joydev sg pcspkr e1000e serial_core tpm_tis tpm tpm
_bios serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot
dm_zero dm_mirror dm_log dm_mod usb_storage ahci libata raid10 shpchp mpt2sas
 scsi_transport_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 464, comm: md0_raid1 Not tainted 2.6.18-348.6.
1.el5xen #1
RIP: e030:[<ffffffff880e6ffd>]  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
RIP: e030:[<ffffffff880e6ffd>]  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
RSP: e02b:ffffffff807a1d20  EFLAGS: 00010002
RAX: 0000000000000008 RBX: 0000000000000003 RCX: ffffffff880d7057
RDX: 2020202020202045 RSI: 0000000034202020 RDI: ffff88003e71fed8
RBP: ffff880040000000 R08: ffff880006658000 R09: ffff880006658000
R10: 0000001796adc000 R11: 0000000000000000 R12: ffff8800220e6e40
R13: ffff88003e7222f8 R14: ffff88003e9704f8 R15: ffff88003e71fee0
FS:  00002b3ead0801e0(0000) GS:ffffffff80639000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
[Wed May 29 08:29:09 2013]Process md0_raid1 (pid: 464, threadinfo ffff88003e1a0000, task ffff88003ee6c080)
Stack:  ffff88003f721418  ffff88003efe1000  00000bc18085b280  ffff88003ea5e0c0
 ffff88003ea5e080  94000000d5000000  fffffff4fffe3680  0000000300000bc1
 140000000000000f  ffff8800220e6e40
Call Trace:
 <IRQ>  [<ffffffff88084dbb>] :scsi_mod:scsi_dispatch_cmd+0x2ac/0x366
 [<ffffffff8808a506>] :scsi_mod:scsi_request_fn+0x2c7/0x39e
 [<ffffffff8025ebf5>] blk_run_queue+0x41/0x73
 [<ffffffff880893cd>] :scsi_mod:scsi_next_command+0x2d/0x39
 [<ffffffff8808954e>] :scsi_mod:scsi_end_request+0xbf/0xcd
[Wed May 29 08:29:09 2013] [<ffffffff880896da>] :scsi_mod:scsi_io_completion+0x17e/0x339
 [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff880b67ce>] :sd_mod:sd_rw_intr+0x21e/0x258
 [<ffffffff88089956>] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [<ffffffff80239541>] blk_done_softirq+0x67/0x75
 [<ffffffff80212f3a>] __do_softirq+0x8d/0x13b
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026eb41>] do_softirq+0x31/0x90
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff8034526a>] cfq_latter_request+0x0/0x1e
[<ffffffff8021cb71>] generic_make_request+0xb6/0x228
 [<ffffffff88075bda>] :raid1:flush_pending_writes+0x6d/0x8e
 [<ffffffff88076bf4>] :raid1:raid1d+0x39/0xbc3
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8026365a>] schedule_timeout+0x1e/0xad
 [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
[<ffffffff804153c7>] md_thread+0xf8/0x10e
 [<ffffffff8029f015>] autoremove_wake_function+0x0/0x2e
 [<ffffffff804152cf>] md_thread+0x0/0x10e
 [<ffffffff80233eb3>] kthread+0xfe/0x132
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233db5>] kthread+0x0/0x132
 [<ffffffff80260b22>] child_rip+0x0/0x12
Code: 48 8b 55 10 49 8b 8e c8 03 00 00 75 06 8b 74 24 2c eb 04 8b
RIP  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
 RSP <ffffffff807a1d20>
CR2: ffff880040000010
 <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Packet loss issues now fixed

| | Comments (0)
the packet loss some routes experienced between 4:30pm and 8:10pm  PST is now fixed.

03:14 <@prgmrcom> OK, so what happened was that I'm getting rid of my shit in 
                  c07 and c08 in room 1435 at 55 s. market.   
03:15 <@prgmrcom> I got another cross connect from egi to yd33 (in room 1435... 
                  gah, c07 and c08 were in 1460, not 1435) 
03:15 <@prgmrcom> on the same network.
03:15 <@prgmrcom> and I told them to make it 100Mbps.
03:15 < Bugged> getting 1% or less loss now
03:15 < refreshingapathy> let me guess, they dropped a 0
03:15 <@prgmrcom> so they set it to 100mbps.   and apparently the switch in c08 
                  was configured to 1000/full or something.    
03:16 < refreshingapathy> ahhh.
03:16 <@prgmrcom> I switched to the other cross connect that didn't use that 
03:16 <@prgmrcom> didn't use that switch.
03:16 <@prgmrcom> so now I just need to get the garbage out of c08.  

Update on holmes

| | Comments (0)
The results so far look promising.  If we are interpreting things correctly there should be no more data loss than on any unclean shutdown. But in order to avoid the same sort of mistakes that caused the original situation, we are taking extra time in order to minimize the possibility that something else goes wrong.  We should know one way or another within the next day.
srn is working on the actual data recovery right now. 

First?  regardless of the results here, while holmes is down, I'm going to double the ram in the thing (and double ram for all users on holmes)   - holmes users will keep that ram doubling regardless of the results of our data recovery efforts.    (note, when allocations are increased for all users, your allocations won't go down, but your increase will be based on what you are paying for, not based on what you have now... e.g. if I double your ram now, and then I give all users 4x ram, you will get 4x what you are paying for, or double what you will have after I upgrade holmes.) 

Next, I'm going to explain a bit about how our pxe setup is setup, and how we will change this.

I mean, at root here?  I fucked up.  I should have checked the pxe before booting, and failing that, I should have been faster about yanking the power when it came up with 'welcome to Centos' instead of 'centos rescue' in the top left corner of the serial console.   I would have been less likely to make this mistake if I was less burnt out.  (I have pxe-booted servers into rescue mode hundreds, perhaps thousands of times with this system, and this is the first time I've made this mistake.)    So yeah.    It's /very/ good that  this was the last of the xen servers to move.  I still have dedicated servers to move, but I can do those one at a time, and usually I can have my co-lo customers come in and help me move their stuff for them. 

First?  the current setup:   Holmes has a mac address of 00:25:90:2c:a3:a0.  So, our dhcp server has an entry like so:

host {
        hardware ethernet 00:25:90:2c:a3:a0;
        option host-name "";
        filename "pxelinux.0";

our tftp server has the file pxelinux.0, which then TFTP downloads the file


which has the following contents:

SERIAL 0 38400
default centos
label centos
kernel vmlinuz
append initrd=initrd.img serial console=ttyS0,38400n8 ks= ksdevice=00:25:90:2c:a3:a0
#append initrd=initrd.img serial console=ttyS0,38400n8
#append initrd=initrd.img serial console=ttyS0,38400n8 ksdevice=00:25:90:2c:a3:a0 rescue

(That password hash is temporary, and at no time is root login permitted over ssh anyhow, so I'm okay with this being public.   Really, the big weak (security) point in this system is if I somehow screw up the firewall rules that prevent customers from running their own DHCP servers...  but I'd have to screw up that rule /and/ the malicious customer would have to be running the dhcp server as I configured the server to pxeboot.  in normal operation, the servers boot off disk, not network.   But it's still a weak point.   I ought to PXE off of an internal, trusted-host-only network.)

Anyhow, by policy, after we install the server, we comment out the line with the ks= line (that's the install line)  and uncomment the

append initrd=initrd.img serial console=ttyS0,38400n8 rescue ksdevice=00:25:90:2c:a3:a0


I'm going through with grep right now and verifying that this has been done everywhere.  (It hasn't.  I'm fixing by hand.) 

For now, I've added 

grep ks /srv/tftp/pxelinux.cfg/* |grep -v "#append" |grep -v rescue

to my crontab, so I'll be emailed  if there is a dangerous pxelinux.cfg, but that's a pretty clumsy way of going about it.

Next?  I am going to change my standard .ks files.   First, I'm going to have Nick remove all the .ks files he has on his webserver  (without a .ks file, nothing is going to happen automatically.)  and I'm going to hide all the .ks files that I control, too.  

Going forward?  I'm going to automate the management of the kickstart files, but that's a project for another day.

I screwed up.

| | Comments (0)
Holmes was having problems after the move;  it wouldn't boot.   I pxe-booted holmes (as is standard practice when a system won't boot) without first verifying that I had removed the .ks file and appended 'rescue' to the kernel in the pxelinux.cfg file.   (policy is to do that as soon as you install a new system, before it goes production, but it's not automated, so it doesn't always happen.)

clearpart was run, and a new / partition was laid down and formatted (but  no customer data is on the / partition, save for your public keys)  

Clearpart means that the metadata for the LVM partitions was removed, but I stopped the install before new data was written to the LVM, so it's possible we will be able to recover the data.   

Either way, I'm clearly not in any shape to handle root. I hope the problem is insufficent sleep.  I will sleep and report back in the morning. 

Note, as of now?  all xen hosts are either in yd33 at 55 s. market or in coresite santa clara 2972 stender.    I still have dedicated servers to do, but that won't impact the VPS customers.
so yeah, srn found a bug;    the NIC offloading stuff has always not worked properly for virtual guests... but with the latest RHEL/CentOS kernel it's gone from 'you drop a few packets every now and then'  to "takes down your guest entirely if you send just one packet"  

So yeah, uh,  we'll change the starting image to add

ethtool -K eth0 tso off gso off 

to /etc/rc.local.   Please do the same on your guest.

details from srn:

4 separate domu's have been seeing an instance of this bug - probably more will do so as they upgrade:

This behavior on the dom0 side (disconnecting when it sees a packet that is too large) was introduced in 2.6.18-348.4.1.el5.  It is not present in .6.18-348.3.1.el5.  It is still present in 2.6.18-348.6.1.el5 (latest.)

40 of our servers have 2.6.18-348.4.1.el5.

There is a bug fix:

But I don't know what the status of that is WRT centos.  I guess this redhat bug is related:

But without a redhat account we can't look.

domu's can work around this (apparently with some performance impact) by running

ethtool -K eth0 tso off gso off

Considering we have 40 servers running 4.1 and only 4 people have been affected, is the best thing to do just to send a list out to announce / the blog and throw swatch on the console logs?

I may poke at the centos virt mailing list and ask if they know if there's a timeline for applying the patch to netback I linked to above.

dhcp problems (now fixed)

| | Comments (0)
The following subnets may have had issues acquiring a dhcp lease from approximately 03:30 to 14:30 PST 2013-05-21:


The problem is we are running dhcrelay on those subnets, which forwards dhcp requests to another server, and they were using a cached dns lookup which pointed to an old ip address rather than the current for the dhcp server.

NOTE:  please read the report written by srn.  It's better:

So, a co-lo customer brings in a desktop so cheap that it has a manual 120-240
switch.  My co-lo, of course, is 208.

Everything made in the last 10 years auto-switches 100-240v. But not this

Anyhow, he plugged it in, and this destroyed my PDU.  We went to the
office and grabbed a spare, mounted it, and you should be back up (the
power cord situation in rack 05-11 is now much, much worse, due to plugging
it back in during a panic.)

This is my fault, of course;  I thought this person was pretty competent,
and I had a little room left in 05-11 and nowhere else, so I put him in with
the production stuff.   Clearly, this was a mistake.

Interestingly, this was on a sub-pdu (that was plugged in via a c19->c20;  my main pdu has several c19 outlets as well as a bunch of c13)  - and the sub-pdu works just fine still;  it was the main PDU that completely fried.   I need to learn more about electricity.     

anyhow, you should be back up now.  I'm sorry.

unplanned downtime in santa clara

| | Comments (0)
UPDATE 2:49 PST: All servers should be back up.

UPDATE: The problem was a dedicated server customer plugged in a computer with a configurable power supply at the wrong setting (120V instead of 240V) and it blew out the pdu.  We picked up another pdu from the office and are installing it right now.
Following servers are down, starting at about 01:10 PST:


Reasons are unknown (to me at least) at this time - will post with further updates as I have them.


| | Comments (0)
Please replace any existing nameserver entries with the following:


In short, for those people who have been renumbered is becoming is becoming

reverse dns mappings should be preserved.

The old IPs are only around for another day or so, so if you are not pingable on PLEASE take action. 

We intend to send a follow up email for those vps's which are still not responding but you are responsible for making sure your server has the correct network configuration, we do not edit files on your machine. 

As a side note, if you do not remember which server your vps is hosted on, current users should be able to log in as <user>@<user>


| | Comments (0)
Just to let you know, support is getting a little backed up with all the move going on and the IP address renumbering.  Please be patient.  We are attempting to get back to normal as soon as possible.

If you need something urgently, please go to #prgmr on Freenode IRC and ask for prgmrcom (Luke), srn_prgmr (Sarah), danihan (Dan), lifftchi (Chris), nb (Nick), or _will_ (Will) and give us your ticket number, and we will attempt to take care of it as soon as we can.  Otherwise, we are attempting to get through the ticket queue as quickly as possible.

Additional unplanned downtime

| | Comments (0)
so yeah, we have a handful of dell C6100s in the bottom of each of the racks.   These are me working on a deal where I help unixsurplus lease some of their stuff.  Anyhow, this evening around 8:00pm my time, Miles was swapping out one of the units.  And here is why it's a bad idea to put these things on the bottom of the rack;   the 4 blades all come out the back... and the l6-30 is right there in the way.   Well, the upshot is that power got disconnected for the whole rack.  

This outage was exacerbated by me just doing the minimum to get servers up last night.   some of the servers, for instance, defaulted to non-xen linux in the grub config (we do this, often, when troubleshooting the serial console, for instance, we then manually select the xen  kernel, and then fix menu.lst after the box is up.  Well, I didn't fix menu.lst when I was configuring these this morning.) 

So yeah, what are we going to do to fix this problem?   well, first, I think the importance of being careful around power connectors has been impressed upon miles.  but more importantly, now we are moving all the c6100 units up to mid rack height.  
Wow.  That was a lot more than an hour of downtime.   I'm sorry.   Several
things went wrong.  You know all those plans I was talking about with
regards to having everything set and ready to go?   Well, first?
the console server wasn't all the way setup.    Next?  turns out, I had 5
pre-made opengear dongles, not 10 like I needed;  I of course have plenty
of unconfigured db9->rj45 adaptors about, but the instructions I had were
by color (which is stupid, as there aren't standard colurs for the
rj45->db9 dongles, and my new ones are different)  and it took me a while
to figure out how to make a proper opengear rj45->db9 dongle.

Then, the rest?  mostly just being stupid and tired.   Like we swapped the
network ports (which I had pre-configured) then stood around like idiots
wondering why the thing boots so slowly.

Anyhow, I've got a 5 year lease on my four 5kw racks here in coresite
santa clara, so for you?  the ordeal is over.  Yes, at some point I'll
want to move you on to better hardware, but that usually comes with better
performance and usually more disk/ram and stuff, and often it can be done
at the user's convenience.  (that's what I told the users who moved, anyhow.)

Anyhow, I think I'm not moving anything tonight.  I will need to clean up the
co-lo, but most of my mistakes here were of the tired and stupid variety.  

I'm... kindof an idiot.   See, I'm up writing this instead of sleeping, when I'm clearly stupid at this point.   but yeah.  Thing is?  the new network consists of a force10 10G switch (4x fxp, 24xcx4) switch for all four racks, plus one woven trx100 (4 port 10g cx4, 48 port 1g rj45) switch per rack.  I'm using lacp to aggrigate the 10g links into one 40G link (it actually works pretty well, because you all have different mac address, the traffic is well-balanced.   a common problem with layer two link aggregation is that usually it uses the source and dest mac addresses to figure out which link to send a packet down, so it's easy to get unbalanced traffic... but all you have different mac addresses, so it spreads traffic as well as a much larger network.)

Anyhow, uh, yeah.   lemme check my notes.

okay, on the woven?  to put a port on a vlan, you do something like this:

interface 0/3
vlan pvid 30

or so.    Then, on the aggrigate, you do something like:

on the woven:
interface 3/1
vlan acceptframe vlanonly
vlan participation exclude 1
vlan participation include 30,33
vlan tagging 30,33
lacp collector max-delay 0

See, I thought that the 'vlan tagging' thing was for access-mode links, to tag incoming packets that don't already have a vlan header with a particular vlan, so I removed (!) it.    from a working (!)  vlan.  so clearly, it was a stupid, stupid, undercaffinated mistake.  

I figured the problem out shortly after making myself a cup of coffee.  
uh, yeah, set your IPs statically for now.  We're getting reports of DHCP being broken in some places but not others.  It doesn't make sense to me right now.  I'm going to setup a dhrelay server on each subnet and each location, and that should cover it.   

An error in judgement

| | Comments (0)
so yeah, I emailed 15 servers of users, told them they'd have an hour of downtime "Saturday Evening" 

I figured I'd do 5 servers at a go.   small number.   

So yeah, I came up with some spare rails, screwed 'em in, setup the PDU, etc... did all the usual things you do before you move servers.   But these things have a way of taking longer than you expect, and I also had to configure an unfamiliar switch (my force10 10gbe setup, which I still don't have configured) 

So after we setup the destination co-lo, we headed to the source co-lo (55 s. market.) - this means getting tickets to remove servers and arranging it so that the security folk know we are coming and verify our tickets before we shut things down (it always takes a few tries to get the ticket right to leave with hardware.)   so Anyhow, it was around 4:30AM local time when we started shutting things down.  It's 9:00am now.     I really should have called it off around 1:00am or so.   

I mean, the rest of our problems have to do with the fact that these are some of the oldest servers (both hardware wise and software wise) in the fleet.   Well, that, and I'm not exactly at my mental best at this point.

Anyhow, we're back up.  I'm sorry.  

so, uh, yeah.  stupid mistake on my part.   approx. 20 mins. downtime (longer for IPv6)    you are back now

and council today

| | Comments (0)
UPDATE 2013-05-03 1:42 PST : Drive is replaced, the raid is rebuilding.  Estimated completion time is ~ 2013-05-04 10:30 PST.

Why isn't it failing out of the raid? and that is an 'enterprise' -  weird.   anyhow, I failed the drive.   (interestingly, this is one of those ancient servers that still has a 'consumer' drive- it was the 'enterprise' not the 'consumer' that is bad.)

anyhow, I'm off to swap a drive.  

1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
end_request: I/O error, dev sdc, sector 283579720
SCSI device sdc: 2930277168 512-byte hdwr sectors (1500302 MB)
sdc: Write Protect is off
SCSI device sdc: drive cache: write back
SCSI device sdc: 2930277168 512-byte hdwr sectors (1500302 MB)
sdc: Write Protect is off
SCSI device sdc: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
SCSI device sdc: 2930277168 512-byte hdwr sectors (1500302 MB)
sdc: Write Protect is off
SCSI device sdc: drive cache: write back
raid1: sdb2: redirecting sector 262614887 to another mirror

and yeah. I'm going to take the marvell card out of Wilson and replace it with a LSI or something. 

It looks like the dom0 got hit harder by the crash than the DomUs, though I would expect some of you will have to fsck.  got many of my updates this time, though as the evening wore, my coherency did, too. 

serious issues on rehnquist

| | Comments (0)
[root@rehnquist ~]# vi /boot/grub/menu.lst 
Segmentation fault

from dmesg:

mvsas 0000:07:00.0: RXQ_ERR 20005
mvsas 0000:07:00.0: RXQ_ERR 20007
mvsas 0000:07:00.0: RXQ_ERR 20006
mvsas 0000:07:00.0: RXQ_ERR 20005
mvsas 0000:07:00.0: RXQ_ERR 20004
mvsas 0000:07:00.0: RXQ_ERR 20004
mvsas 0000:07:00.0: RXQ_ERR 20005
mvsas 0000:07:00.0: RXQ_ERR 20005
mvsas 0000:07:00.0: RXQ_ERR 20007
mvsas 0000:07:00.0: RXQ_ERR 20007

So... one possible course of action is to take it down and tear out the marvell card.   I'm going to do that now.   

I am just going to swap it with the spare.   

About this Archive

This page is an archive of entries from May 2013 listed from newest to oldest.

April 2013 is the previous archive.

June 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.