February 2011 Archives

crock reboot

| | Comments (0)
Crock crashed at 7:06 PST this morning, with the same "soft lockup detected on cpu #0" error we've seen before with it and some of the other dom0s, mostly running Xen 3.4 I think. I rebooted it now and domUs are starting to come back up.

Update: everybody on crock should be back up now except for one person whose menu.lst file seems to be unreadable. Email support@prgmr.com if you are still having trouble. Thanks.

knife reboot

| | Comments (0)
Knife crashed again and I suspect it was because of the "soft lockup" problem we have seen previously. I don't know for sure because its console server isn't part of the logging program we are using with other servers, so we really need to finish getting that integrated. Everybody on knife should be back up and running now though. Email support@prgmr.com if you are still having trouble. Thanks!

network outage at SVTIX again

| | Comments (2)
I think it lasted about 20 minutes.   I believe it was an upstream problem (though this is not confirmed at the moment, and boy, won't I be embarrassed if it turns out to be my router crashing.)    This upstream was pretty good for the first year or so, but they've been getting less reliable.  I guess we need to quit talking about it and build our own BGP router and get a secondary upstream.  

I've been holding back on this project just 'cause nick and I don't have a lot of experience running BGP;  we're announcing some swamp one of our customers has as a start, but even that hasn't been without hiccups.   Generally speaking, I would think that leaving it to my upstream would result in a more reliable system,  but of late, that hasn't been the case, so I suppose we need to roll our BGP systems in to production.  

edit:  here is what our upstream had to say:

"EGIHosting.com - Support" writes:

Hi Luke,
There was an emergency maint that was done on the BACKBONE fiber going to SJ DC.  Everything is done and back up normal now.


they've been blaming most of the recent outages on XO, who apparently provides them point to point links to whoever they actually buy bandwidth from.

I need to find out who can get me transit at SVTIX without going through those (apparently fragile) XO lines.

hamper reboot

| | Comments (0)
Hamper.prgmr.com rebooted now, and we don't have logging setup for its serial console so we may not be able to find out why. I'm looking at the logfiles though for other clues. I'll update here again when everybody on hamper is up and running again. We're also planning to setup conserver with logging for the servers at Fremont like it is at svtix, we should probably do that sooner because of this.
update: everybody should be back up now, except for one person who I emailed. Email support if you still have problems. Thanks!
wow.   linux raid 1+0 seems to rebuild a /whole lot/ faster than our previous setup, which was using lvm to stripe across 2 linux raid 1 mirrors.     

Personalities : [raid1] [raid10]
md1 : active raid10 sde2[4] sdb2[1] sdc2[2] sda2[0] sdd2[5](F)
      955802624 blocks 256K chunks 2 near-copies [4/3] [UUU_]
      [>....................]  recovery =  4.0% (19262848/477901312) finish=212.
7min speed=35926K/sec

md0 : active raid1 sde1[3] sdb1[1] sdc1[2] sda1[0] sdd1[4](F)
      10482304 blocks [4/4] [UUUU]

unused devices: <none>

knife hung hard.

| | Comments (3)
not sure what the problem is.   serial console was configured correctly, but I see nothing on the serial console.   I will have nick go over it later, he set up some logging that I might have missed.  

Anyhow, the server is coming back up right now.  

edit: knife is back up.  Complain loudly (preferably to support@prgmr.com) if you are still down

network outage at SVTIX last night

| | Comments (0)
I'm still figuring out the details.  

Oh, and thanks to everyone who included traceroute output in their complaints.  that helps quite a bit. the problem was clearly some sort of networking 

update:  my upstream got back to me and said

"Hi Luke-

It appears there was a Fibre line cut which caused the network outage. The connection has since been repaired by XO. Please let us know if you need anything else."

As I'm sure you can all see, I don't see any 'xo' on my route... my belief is that they are leasing (somewhat unreliable) line from XO to backhaul he.net bandwidth to the svtix data center, as nobody in he.net fremont went down.