luke: May 2013 Archives

Packet loss issues now fixed

| | Comments (0)
the packet loss some routes experienced between 4:30pm and 8:10pm  PST is now fixed.


03:14 <@prgmrcom> OK, so what happened was that I'm getting rid of my shit in 
                  c07 and c08 in room 1435 at 55 s. market.   
03:15 <@prgmrcom> I got another cross connect from egi to yd33 (in room 1435... 
                  gah, c07 and c08 were in 1460, not 1435) 
03:15 <@prgmrcom> on the same network.
03:15 <@prgmrcom> and I told them to make it 100Mbps.
03:15 < Bugged> getting 1% or less loss now
03:15 < refreshingapathy> let me guess, they dropped a 0
03:15 <@prgmrcom> so they set it to 100mbps.   and apparently the switch in c08 
                  was configured to 1000/full or something.    
03:16 < refreshingapathy> ahhh.
03:16 <@prgmrcom> I switched to the other cross connect that didn't use that 
                  link.
03:16 <@prgmrcom> didn't use that switch.
03:16 <@prgmrcom> so now I just need to get the garbage out of c08.  
srn is working on the actual data recovery right now. 

First?  regardless of the results here, while holmes is down, I'm going to double the ram in the thing (and double ram for all users on holmes)   - holmes users will keep that ram doubling regardless of the results of our data recovery efforts.    (note, when allocations are increased for all users, your allocations won't go down, but your increase will be based on what you are paying for, not based on what you have now... e.g. if I double your ram now, and then I give all users 4x ram, you will get 4x what you are paying for, or double what you will have after I upgrade holmes.) 

Next, I'm going to explain a bit about how our pxe setup is setup, and how we will change this.

I mean, at root here?  I fucked up.  I should have checked the pxe before booting, and failing that, I should have been faster about yanking the power when it came up with 'welcome to Centos' instead of 'centos rescue' in the top left corner of the serial console.   I would have been less likely to make this mistake if I was less burnt out.  (I have pxe-booted servers into rescue mode hundreds, perhaps thousands of times with this system, and this is the first time I've made this mistake.)    So yeah.    It's /very/ good that  this was the last of the xen servers to move.  I still have dedicated servers to move, but I can do those one at a time, and usually I can have my co-lo customers come in and help me move their stuff for them. 

First?  the current setup:   Holmes has a mac address of 00:25:90:2c:a3:a0.  So, our dhcp server has an entry like so:

host holmes.prgmr.com {
        hardware ethernet 00:25:90:2c:a3:a0;
        fixed-address 71.19.149.4;
        option host-name "taft.prgmr.com";
        next-server 71.19.158.245;
        filename "pxelinux.0";
}

our tftp server has the file pxelinux.0, which then TFTP downloads the file

 pxelinux.cfg/01-00-25-90-2c-a3-a0


which has the following contents:

SERIAL 0 38400
default centos
label centos
kernel vmlinuz
append initrd=initrd.img serial console=ttyS0,38400n8 ks=http://www.schmalenberger.us/prgmr/holmes-ks.cfg ksdevice=00:25:90:2c:a3:a0
#append initrd=initrd.img serial console=ttyS0,38400n8
#append initrd=initrd.img serial console=ttyS0,38400n8 ksdevice=00:25:90:2c:a3:a0 rescue


(That password hash is temporary, and at no time is root login permitted over ssh anyhow, so I'm okay with this being public.   Really, the big weak (security) point in this system is if I somehow screw up the firewall rules that prevent customers from running their own DHCP servers...  but I'd have to screw up that rule /and/ the malicious customer would have to be running the dhcp server as I configured the server to pxeboot.  in normal operation, the servers boot off disk, not network.   But it's still a weak point.   I ought to PXE off of an internal, trusted-host-only network.)


Anyhow, by policy, after we install the server, we comment out the line with the ks= line (that's the install line)  and uncomment the

append initrd=initrd.img serial console=ttyS0,38400n8 rescue ksdevice=00:25:90:2c:a3:a0

line.  

I'm going through with grep right now and verifying that this has been done everywhere.  (It hasn't.  I'm fixing by hand.) 

For now, I've added 

grep ks /srv/tftp/pxelinux.cfg/* |grep -v "#append" |grep -v rescue

to my crontab, so I'll be emailed  if there is a dangerous pxelinux.cfg, but that's a pretty clumsy way of going about it.

Next?  I am going to change my standard .ks files.   First, I'm going to have Nick remove all the .ks files he has on his webserver  (without a .ks file, nothing is going to happen automatically.)  and I'm going to hide all the .ks files that I control, too.  

Going forward?  I'm going to automate the management of the kickstart files, but that's a project for another day.

I screwed up.

| | Comments (0)
Holmes was having problems after the move;  it wouldn't boot.   I pxe-booted holmes (as is standard practice when a system won't boot) without first verifying that I had removed the .ks file and appended 'rescue' to the kernel in the pxelinux.cfg file.   (policy is to do that as soon as you install a new system, before it goes production, but it's not automated, so it doesn't always happen.)

clearpart was run, and a new / partition was laid down and formatted (but  no customer data is on the / partition, save for your public keys)  

Clearpart means that the metadata for the LVM partitions was removed, but I stopped the install before new data was written to the LVM, so it's possible we will be able to recover the data.   

Either way, I'm clearly not in any shape to handle root. I hope the problem is insufficent sleep.  I will sleep and report back in the morning. 


Note, as of now?  all xen hosts are either in yd33 at 55 s. market or in coresite santa clara 2972 stender.    I still have dedicated servers to do, but that won't impact the VPS customers.
so yeah, srn found a bug;    the NIC offloading stuff has always not worked properly for virtual guests... but with the latest RHEL/CentOS kernel it's gone from 'you drop a few packets every now and then'  to "takes down your guest entirely if you send just one packet"  

So yeah, uh,  we'll change the starting image to add

ethtool -K eth0 tso off gso off 

to /etc/rc.local.   Please do the same on your guest.

details from srn:

4 separate domu's have been seeing an instance of this bug - probably more will do so as they upgrade:

http://xen.crc.id.au/bugs/view.php?id=3

This behavior on the dom0 side (disconnecting when it sees a packet that is too large) was introduced in 2.6.18-348.4.1.el5.  It is not present in .6.18-348.3.1.el5.  It is still present in 2.6.18-348.6.1.el5 (latest.)

40 of our servers have 2.6.18-348.4.1.el5.

There is a bug fix:

http://lists.xen.org/archives/html/xen-devel/2013-04/msg01328.html

But I don't know what the status of that is WRT centos.  I guess this redhat bug is related:
https://bugzilla.redhat.com/show_bug.cgi?id=957231

But without a redhat account we can't look.

domu's can work around this (apparently with some performance impact) by running

ethtool -K eth0 tso off gso off

Considering we have 40 servers running 4.1 and only 4 people have been affected, is the best thing to do just to send a list out to announce / the blog and throw swatch on the console logs?

I may poke at the centos virt mailing list and ask if they know if there's a timeline for applying the patch to netback I linked to above.


NOTE:  please read the report written by srn.  It's better:

http://blog.prgmr.com/xenophilia/2013/05/unplanned-downtime-in-santa-cl.html

So, a co-lo customer brings in a desktop so cheap that it has a manual 120-240
switch.  My co-lo, of course, is 208.

Everything made in the last 10 years auto-switches 100-240v. But not this
garbage.

Anyhow, he plugged it in, and this destroyed my PDU.  We went to the
office and grabbed a spare, mounted it, and you should be back up (the
power cord situation in rack 05-11 is now much, much worse, due to plugging
it back in during a panic.)

This is my fault, of course;  I thought this person was pretty competent,
and I had a little room left in 05-11 and nowhere else, so I put him in with
the production stuff.   Clearly, this was a mistake.

Interestingly, this was on a sub-pdu (that was plugged in via a c19->c20;  my main pdu has several c19 outlets as well as a bunch of c13)  - and the sub-pdu works just fine still;  it was the main PDU that completely fried.   I need to learn more about electricity.     

anyhow, you should be back up now.  I'm sorry.

Additional unplanned downtime

| | Comments (0)
so yeah, we have a handful of dell C6100s in the bottom of each of the racks.   These are me working on a deal where I help unixsurplus lease some of their stuff.  Anyhow, this evening around 8:00pm my time, Miles was swapping out one of the units.  And here is why it's a bad idea to put these things on the bottom of the rack;   the 4 blades all come out the back... and the l6-30 is right there in the way.   Well, the upshot is that power got disconnected for the whole rack.  

This outage was exacerbated by me just doing the minimum to get servers up last night.   some of the servers, for instance, defaulted to non-xen linux in the grub config (we do this, often, when troubleshooting the serial console, for instance, we then manually select the xen  kernel, and then fix menu.lst after the box is up.  Well, I didn't fix menu.lst when I was configuring these this morning.) 


So yeah, what are we going to do to fix this problem?   well, first, I think the importance of being careful around power connectors has been impressed upon miles.  but more importantly, now we are moving all the c6100 units up to mid rack height.  
Wow.  That was a lot more than an hour of downtime.   I'm sorry.   Several
things went wrong.  You know all those plans I was talking about with
regards to having everything set and ready to go?   Well, first?
the console server wasn't all the way setup.    Next?  turns out, I had 5
pre-made opengear dongles, not 10 like I needed;  I of course have plenty
of unconfigured db9->rj45 adaptors about, but the instructions I had were
by color (which is stupid, as there aren't standard colurs for the
rj45->db9 dongles, and my new ones are different)  and it took me a while
to figure out how to make a proper opengear rj45->db9 dongle.

Then, the rest?  mostly just being stupid and tired.   Like we swapped the
network ports (which I had pre-configured) then stood around like idiots
wondering why the thing boots so slowly.

Anyhow, I've got a 5 year lease on my four 5kw racks here in coresite
santa clara, so for you?  the ordeal is over.  Yes, at some point I'll
want to move you on to better hardware, but that usually comes with better
performance and usually more disk/ram and stuff, and often it can be done
at the user's convenience.  (that's what I told the users who moved, anyhow.)


Anyhow, I think I'm not moving anything tonight.  I will need to clean up the
co-lo, but most of my mistakes here were of the tired and stupid variety.  


I'm... kindof an idiot.   See, I'm up writing this instead of sleeping, when I'm clearly stupid at this point.   but yeah.  Thing is?  the new network consists of a force10 10G switch (4x fxp, 24xcx4) switch for all four racks, plus one woven trx100 (4 port 10g cx4, 48 port 1g rj45) switch per rack.  I'm using lacp to aggrigate the 10g links into one 40G link (it actually works pretty well, because you all have different mac address, the traffic is well-balanced.   a common problem with layer two link aggregation is that usually it uses the source and dest mac addresses to figure out which link to send a packet down, so it's easy to get unbalanced traffic... but all you have different mac addresses, so it spreads traffic as well as a much larger network.)

Anyhow, uh, yeah.   lemme check my notes.

okay, on the woven?  to put a port on a vlan, you do something like this:

interface 0/3
vlan pvid 30


or so.    Then, on the aggrigate, you do something like:

on the woven:
interface 3/1
vlan acceptframe vlanonly
vlan participation exclude 1
vlan participation include 30,33
vlan tagging 30,33
lacp collector max-delay 0
exit


See, I thought that the 'vlan tagging' thing was for access-mode links, to tag incoming packets that don't already have a vlan header with a particular vlan, so I removed (!) it.    from a working (!)  vlan.  so clearly, it was a stupid, stupid, undercaffinated mistake.  

I figured the problem out shortly after making myself a cup of coffee.  
uh, yeah, set your IPs statically for now.  We're getting reports of DHCP being broken in some places but not others.  It doesn't make sense to me right now.  I'm going to setup a dhrelay server on each subnet and each location, and that should cover it.   

An error in judgement

| | Comments (0)
so yeah, I emailed 15 servers of users, told them they'd have an hour of downtime "Saturday Evening" 

I figured I'd do 5 servers at a go.   small number.   

So yeah, I came up with some spare rails, screwed 'em in, setup the PDU, etc... did all the usual things you do before you move servers.   But these things have a way of taking longer than you expect, and I also had to configure an unfamiliar switch (my force10 10gbe setup, which I still don't have configured) 

So after we setup the destination co-lo, we headed to the source co-lo (55 s. market.) - this means getting tickets to remove servers and arranging it so that the security folk know we are coming and verify our tickets before we shut things down (it always takes a few tries to get the ticket right to leave with hardware.)   so Anyhow, it was around 4:30AM local time when we started shutting things down.  It's 9:00am now.     I really should have called it off around 1:00am or so.   

I mean, the rest of our problems have to do with the fact that these are some of the oldest servers (both hardware wise and software wise) in the fleet.   Well, that, and I'm not exactly at my mental best at this point.

Anyhow, we're back up.  I'm sorry.  


so, uh, yeah.  stupid mistake on my part.   approx. 20 mins. downtime (longer for IPv6)    you are back now

and council today

| | Comments (0)
UPDATE 2013-05-03 1:42 PST : Drive is replaced, the raid is rebuilding.  Estimated completion time is ~ 2013-05-04 10:30 PST.
---

Why isn't it failing out of the raid? and that is an 'enterprise' -  weird.   anyhow, I failed the drive.   (interestingly, this is one of those ancient servers that still has a 'consumer' drive- it was the 'enterprise' not the 'consumer' that is bad.)

anyhow, I'm off to swap a drive.  




1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
end_request: I/O error, dev sdc, sector 283579720
SCSI device sdc: 2930277168 512-byte hdwr sectors (1500302 MB)
sdc: Write Protect is off
SCSI device sdc: drive cache: write back
SCSI device sdc: 2930277168 512-byte hdwr sectors (1500302 MB)
sdc: Write Protect is off
SCSI device sdc: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x25 Emask 0x9 stat 0x51 err 0x40 (media error)
SCSI device sdc: 2930277168 512-byte hdwr sectors (1500302 MB)
sdc: Write Protect is off
SCSI device sdc: drive cache: write back
raid1: sdb2: redirecting sector 262614887 to another mirror




and yeah. I'm going to take the marvell card out of Wilson and replace it with a LSI or something. 

It looks like the dom0 got hit harder by the crash than the DomUs, though I would expect some of you will have to fsck.  


http://twitter.com/prgmrcom  got many of my updates this time, though as the evening wore, my coherency did, too. 



serious issues on rehnquist

| | Comments (0)
[root@rehnquist ~]# vi /boot/grub/menu.lst 
Segmentation fault



from dmesg:

mvsas 0000:07:00.0: RXQ_ERR 20005
mvsas 0000:07:00.0: RXQ_ERR 20007
mvsas 0000:07:00.0: RXQ_ERR 20006
mvsas 0000:07:00.0: RXQ_ERR 20005
mvsas 0000:07:00.0: RXQ_ERR 20004
mvsas 0000:07:00.0: RXQ_ERR 20004
mvsas 0000:07:00.0: RXQ_ERR 20005
mvsas 0000:07:00.0: RXQ_ERR 20005
mvsas 0000:07:00.0: RXQ_ERR 20007
mvsas 0000:07:00.0: RXQ_ERR 20007


So... one possible course of action is to take it down and tear out the marvell card.   I'm going to do that now.   


I am just going to swap it with the spare.   

About this Archive

This page is a archive of recent entries written by luke in May 2013.

luke: March 2013 is the previous archive.

luke: June 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.