luke: February 2011 Archives

network outage at SVTIX again

| | Comments (2)
I think it lasted about 20 minutes.   I believe it was an upstream problem (though this is not confirmed at the moment, and boy, won't I be embarrassed if it turns out to be my router crashing.)    This upstream was pretty good for the first year or so, but they've been getting less reliable.  I guess we need to quit talking about it and build our own BGP router and get a secondary upstream.  

I've been holding back on this project just 'cause nick and I don't have a lot of experience running BGP;  we're announcing some swamp one of our customers has as a start, but even that hasn't been without hiccups.   Generally speaking, I would think that leaving it to my upstream would result in a more reliable system,  but of late, that hasn't been the case, so I suppose we need to roll our BGP systems in to production.  


edit:  here is what our upstream had to say:


"EGIHosting.com - Support" writes:

Hi Luke,
There was an emergency maint that was done on the BACKBONE fiber going to SJ DC.  Everything is done and back up normal now.


--

they've been blaming most of the recent outages on XO, who apparently provides them point to point links to whoever they actually buy bandwidth from.

I need to find out who can get me transit at SVTIX without going through those (apparently fragile) XO lines.
wow.   linux raid 1+0 seems to rebuild a /whole lot/ faster than our previous setup, which was using lvm to stripe across 2 linux raid 1 mirrors.     



Personalities : [raid1] [raid10]
md1 : active raid10 sde2[4] sdb2[1] sdc2[2] sda2[0] sdd2[5](F)
      955802624 blocks 256K chunks 2 near-copies [4/3] [UUU_]
      [>....................]  recovery =  4.0% (19262848/477901312) finish=212.
7min speed=35926K/sec

md0 : active raid1 sde1[3] sdb1[1] sdc1[2] sda1[0] sdd1[4](F)
      10482304 blocks [4/4] [UUUU]

unused devices: <none>



knife hung hard.

| | Comments (3)
not sure what the problem is.   serial console was configured correctly, but I see nothing on the serial console.   I will have nick go over it later, he set up some logging that I might have missed.  

Anyhow, the server is coming back up right now.  

edit: knife is back up.  Complain loudly (preferably to support@prgmr.com) if you are still down

network outage at SVTIX last night

| | Comments (0)
I'm still figuring out the details.  


Oh, and thanks to everyone who included traceroute output in their complaints.  that helps quite a bit. the problem was clearly some sort of networking 


update:  my upstream got back to me and said

"Hi Luke-

It appears there was a Fibre line cut which caused the network outage. The connection has since been repaired by XO. Please let us know if you need anything else."

As I'm sure you can all see, I don't see any 'xo' on my route... my belief is that they are leasing (somewhat unreliable) line from XO to backhaul he.net bandwidth to the svtix data center, as nobody in he.net fremont went down.  

About this Archive

This page is a archive of recent entries written by luke in February 2011.

luke: January 2011 is the previous archive.

luke: March 2011 is the next archive.

Find recent content on the main index or look in the archives to find all content.