luke: October 2012 Archives

if we feel alert enough for another go after that, we might move the remaining servers (birds stables and lion... birds and stables users were warned, but we probably aren't moving them tonight, just out of fear of doing too much at once.)  but we will see.

we are at, we are eating dinner, then downtime will begin.

update 21:31
tooth downtime begins

update 21:33

pillar downtime begins

update 21:35 kvm node downtime begins.

update 00:57 -  pillar and tooth are back

update 1:37  the KVM hosts are online, the drbds look good, we're bringing the kvm guests up now.

packet loss at our fremont location

| | Comments (0)

(I believe the problem was at that router.  No confirmation)

I think the problem lasted for 10 minutes or so and things are okay now.  I need to go back to sleep to prepare for moving more servers tonight.

Note, due to the moves, a lot of that traffic goes down the link above, then through the vpn, and then to 55 s. market (most likely through this link:)

Unplanned downtime on horn

| | Comments (0)
so I get paged about horn during rush hour.  Nothing.  no serial.  It's on a 2 in 1u, so I can't flip the PDU port. 

I head in and that side of the 2 in 1u is /hot/   the other side is just fine.  (It feels like there is plenty of air flowing through that side, but  I don't have a meter or anything, and I can't de-rack that server without taking down it's twin)  

Anyhow, I take another server out of my van and swap disks.   New server has a LSI sas controller and intel nic, so I'm doing the necessary with the kernel to make that work now.    It shouldn't be too much longer.

Update: 21:11

Still down.  working on the kernel upgrade

update 21:37

Kernel upgrade complete, system booted, test domains up, customer domains booting.  I think we are okay. 

update 21:57

and I discovered that I screwed up IPv6.   fixed.
so I think we're as ready as we are going to be.  We've got all the destination ports set up;  all that remains is to shut down the servers, unscrew rails, strap the servers down in the back of my van

22:45: downtime begins

01:55:  confirmed Ingot is back

01:55: confirmed scepter is back

01:57: confirmed stone is back

Table and coral are both still down.
[lsc@bowl ~]$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      972566976 blocks [2/2] [UU]
md2 : active raid1 sdg2[2] sdf2[3](F) sde2[1] sdd2[4](F) sda2[5](F)
      972566976 blocks [2/1] [_U]
      [>....................]  recovery =  0.8% (8320448/972566976) finish=1388.9min speed=11568K/sec
md0 : active raid1 sdg1[0] sdf1[4](F) sde1[1] sdd1[5](F) sdc1[2] sdb1[3] sda1[6](F)
      2096384 blocks [4/4] [UUUU]
unused devices: <none>
[lsc@bowl ~]$
(bowl is from back when we used LVM to do striping.  I will be happy when we are off all of these servers.)


[lsc@rehnquist ~]$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md1 : active raid10 sdf2[4] sde2[2] sdd2[0] sdc2[5](F) sdb2[3] sda2[6](F)
      955802624 blocks 256K chunks 2 near-copies [4/3] [U_UU]
      [====>................]  recovery = 20.6% (98878656/477901312) finish=141.7min speed=44576K/sec
md0 : active raid1 sdf1[1] sde1[2] sdd1[0] sdc1[4](F) sdb1[3] sda1[5](F)
      10482304 blocks [4/4] [UUUU]
unused devices: <none>

Rehnquist is more modern. 

both are rebuilding, no downtime expected. 
going down now (20:20 local time)

ugh.  that took way too long.  22:33 local time, and it's coming back up. 

Note, this was my fault.    I thought I was having serial console problems, but I wasn't. policy is to set 'power on after power fail on' (so that we can always turn things on by bouncing the PDU port. prgmr policy is also to have all PDUs be remotely rebootable)    But this server was set incorrectly.  I plugged it in and went to work on the serial console, where I expected problems.   It took me a very long time to realize that the box was actually off.  

I blame the fact that I didn't schedule anyone else to stand by with me.  I think it's usually best to have one person in the data center and one person in a comfortable chair with good internet access.   This was just me, though.

Also note, I warned the customers with the following message:

We'll be moving hamper to 55 s. market tomorrow night as part of our
plan to move out of Fremont.  This will give us more reliable power
and additionally will actually save us money. has the cheapest
racks around, but limits power so much that it's cheaper to go with more
expensive, higher density racks at 55 s. market.  

but I failed to notify the other employees properly. guh. anyhow, we have a bunch more of these to do this week. The next ones, I hope, will go more smoothly.

replaced bad disk in

| | Comments (0)
[lsc@ingot ~]$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md1 : active raid10 sde2[4] sdb2[1] sda2[0] sdc2[5](F) sdd2[3]
      955802624 blocks 256K chunks 2 near-copies [4/3] [UU_U]
      [====>................]  recovery = 23.0% (110047232/477901312) finish=342.3min speed=17908K/sec
md0 : active raid1 sde1[2] sdb1[1] sda1[0] sdc1[4](F) sdd1[3]
      10482304 blocks [4/4] [UUUU]
unused devices: <none>

everything looks good.

recent packet loss.

| | Comments (0)
so yeah, uh, this is what I think about the recent packet loss issues.

I mean, the root of the problem (I think) is that we are overwhelming the 1G link between 1435 and 1460 at MPT.      Most of this is not traffic, most of this is traffic being trucked from 250 stockton to 55 s. market and pushed out

Here are the two sides of the link in question

compared to our link, here:

so more traffic is going over our cross connect than over the link.  This is because of my poor network design.  

As you can see, we aren't pegging it hard or anything;  70% is our 5 minute peak.  My belief (now that the cable tested good) is that we are seeing 'microbursts'  wherein traffic is peaking above 1G/sec

With my switches, HP procurve 2824s,  I don't know how to verify my theory without a MIRROR port.    

I should bring in a server and setup a mirror port on that thing, and measure the throughput every 5 seconds rather than every 5 minutes or something;  that would allow me to prove or disprove the microbursts theory. 

This customer at 250 stockton  that is 500 or so Mbps of this traffic  is in the process of moving off our network, which should be complete within a week.  That will resolve the issue for us, at least until our own traffic increases about 3x. 

but the problem of this cross connect getting saturated before the uplink will come back when our traffic grows (and as you have noticed, my bandwidth allocations are currently, well, 2007.  I really, really need to start giving customers more bandwidth, which /will/ cause me to use more bandwidth overall.) 

The problem here is that cross connect is part of a  'router on a stick' configuration, without a big enough stick.  I mean, packets from cogent to 1460 (where most of the dom0s are)  go from cogent to 250 stockton, over the transport link from egi to suite 1460 at 55 s. market, up the cross connect from 1460  to 1435, down the same cross connect, and then to the server in question.

So, uh, yeah.  three ways to solve this

1. put a router at 55 s. market in suite 1460;   this is probably the cleanest way to do it, but now we've got another router that can break.    Don't use the router on a stick configuration.

2. get a bigger stick.   For like $300-$350 a month (vs the $100-$175 I'm paying for a copper cross connect)  I can get an optical link from 1460 to 1435.  I then need 10G capable switches in both 1460 and 1435.  Figure 2-6 grand, depending on brand, features, and if I go used or not.     (I can also just get a second copper link,  this will cost about the same monthly, but could be done with our current switches.)

3. traffic engineering.  make sure that all traffic at 250 stockton goes in/out Cogent and all traffic at 55 s. market goes in/out (maybe just prepend so hard that unless a link is completely down, traffic goes in/out the local link?)

I do have an extra box that could be used as a router right now.   Several, in fact. 

The UnixSurplus guy has some arista 48 port 1g switches with 4 10GbE sfp+ ports he might give me a sweetheart deal on  (I have optics)  he will also rent 'em to me if I want.

so I guess I'm kinda leaning towards the router.  Cheaper.  Of course, I also don't like those HP switches;  I wouldn't mind getting rid of 'em in favour of the aristas.  I wouldn't mind one bit.  But, I generally am not a fan of router on a stick designs, and yeah, it'd save me a good chunk of change.  


Anyhow, I need a day off, so I'm taking today off, but I thought customers have a right to know what the hell is going on.    Nick is around today and he might setup a router or at least the SPAN port.  If not, I'll at least get a server with a MIRROR port setup on that procurve on Monday.  . 
10/12/12 07:33:34 FFI: port 2-High collision or drop rate.
                  See help.

This is either a bad cable, or "microbursts"  Let's hope it's a bad cable.  

Network going down for around 5 minutes at 9:00PM for test. 

(Yes, I have multiple upstreams now, but due to bad design on my part, all traffic to C07 and C08, where most of you are, goes over this one cross connect.  My fault, sorry.) 

and we are back.

It wasn't the cable.
[root@taney ~]# date
Mon Oct  1 13:34:55 PDT 2012
[root@taney ~]# hwclock
Thu 03 Jan 2002 10:32:22 AM PST  -0.517048 seconds

It's set now, sorry.

so, marshall went down.  "Invalid memory configuration for cpu 1"  - only seeing half the ram, too.  so I swapped it out with an 8 core E5 with just as much ram.  Much more downtime than there should have been, but marshall users are back now, with newer and stronger hardware. 

oh, also note, the blog is on marshall, so most of the reports went on twitter.

note, I screwed up the time on the new server.  don't just check 'date' also check 'hwclock' and run hwclock --systohc if date is right and hwclock is wrong.  

looks like this caused a bunch of guests to hang on fsck.  I will go through and manually restart them.

About this Archive

This page is a archive of recent entries written by luke in October 2012.

luke: September 2012 is the previous archive.

luke: November 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.