June 2013 Archives

Between 8:00-8:30 PST today some of you may see kernel messages about xvde reconnecting or xvdf connecting. /dev/xvde is the rescue image and /dev/xvdf contains the distribution archives. Due to a bug in our move script some vps' had the wrong rescue image architecture and/or lacked the distro volume entirely.

If after 8:30 PST today you still see the following message:

Error 13: Invalid or unsupported executable format

when booting the rescue image, please try a full shutdown and start of your vps via the management console before contacting support.

More downtime on javier

| | Comments (0)
so, it looks like the power cord we used was not sized for the retention bracket.   it got jostled.   I have no excuse.   

Was this the cause of the previous outages?   well, previous outages didn't correspond to cables getting moved here, and in previous outages, the server was powered up still when I got here, so... probably not.   

It's got a new power cable that is correctly sized for the retention bracket.  

javier down again

| | Comments (0)
5 days ago javier went down http://blog.prgmr.com/xenophilia/2013/06/javier-down.html and we were unable to find a root cause.  In the meantime Luke prepared a new server.

It went down again today at about 11:05 PST.  Luke is on his way right now to swap out the servers.  We anticipate service should be restored within about an hour.

It's back on different hardware.  You should be okay now (the raid is syncing, so performance won't be awesome for a while, but you should be back.)
So yeah. outage tonight.   the link from 55 s. market to 250 stockton needed emergency maintenance.  I buy this link from EGI;  I just buy a tagged port capped at 100Mbps on both ends.  It's not even qinq, I have a short list of vlans.    EGI buys (I think) a wave from XO.  the 'emergency maintenance' alert came down from XO to egi who forwarded it to me with 'scheduled maintenance event' as the subject.  

now, coresite, the place I buy datacenter space from alerts me whenever they do /anything/ to their power  "Planned maintenance, pokin' the batteries"   something like that.   Never had one of those actually impacted my service, so I didn't really read this carefully.   Until my pager goes off.    Yeah.   so XO says it's "emergency maintenance"   - if that was in the header, I'd have read it more closely.    

I mean, my fault for not reading it more closely.  (as-is, I'm almost out of 250 stockton, so it doesn't have redundant connectivity, so other than moving people ahead of time, there wasn't much I could do.) 

As I show up to 250 Stockton, though, I see 4 PG&E trucks departing;  one a giant flatbed with a backhoe on it.  Related? 

Anyhow, we're back, and anyone who had downtime?  contact me to arrange to be moved to the santa clara data center.

javier down

| | Comments (0)
UPDATE: we were not immediately able to find a cause for why javier went down.  Given there was nothing printed to the console it is quite possibly a hardware problem.  We will be reviewing the hardware monitoring configuration.  All guests are up at this time (19:27 PST.)

Javier has been down since approximately 18:10 PST.  There is no indication of the reason on the console and xen is not responding.  It turns out the remote rebooter capabilities have not been fully set up for javier so Luke is heading over to the data center right now to take a look.

About this Archive

This page is an archive of entries from June 2013 listed from newest to oldest.

May 2013 is the previous archive.

July 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.