cattle and girdle down - or "Why read-only friday exists"

As you may know, girdle and cattle (on  are the only boxes that are not in coresite santa clara or coresite san jose.     They are the only boxes I don't have 24x7 physical access to.   these are in Sacramento.

So yeah.   my pager went off;  now, usually when Sacramento goes offline, it's just a network issue, and so I go and ping the hosts, and they come up while I am pinging, so I go about my business.  

An hour or so later,  then my provider sent me an email:

"The reboot device that you are connected to did reset when we updated our configuration. On boot, it looks like there are some errors on the box. I have enabled the KVM so that you can take a look at the console."

I get that sick feeling in my stomach, you know, when you realized that you just ignored an important page. 

So, I login to girdle... it is mostly okay, but the xen packages got screwed up during the last upgrade.   srn unfucked them, then spent some time applying her network security patches, and that screwed things up because it had an old version of the xen networking scripts, but she is figuring it out. but girdle should be back up by the time this blog posts.  (note, people were up on girdle with broken network for a bit, but should be okay now.)    

Cattle?  Cattle on the other hand, somehow managed to completely screw itself in the power outage.  no ssh, no nothing.  I go and hit it from the girdle serial console,  and I'm in single-user mode.     While I'm trying to figure out what's going on, I reboot it into a non-xen kernel with a console=xvc0... which, of course, locks me out. 

The upshot?  Girdle is back for tonight, cattle is down until my provider wakes up.

The lessons here are

1. I need to make sure I have physical access to my stuff.  

I have been talking about moving out of forever now, I need to finish migrating everyone to new servers in coresite with new IPs.

2.  there is a reason for read-only friday;   as far as I can tell, my provider spent all day unfucking his broken rebooter, and then left me on a KVM, went home, turned off his pager and went to sleep.     This is why you don't start things when you are tired or when you are planning on leaving.   Sometimes things take longer than planned;  plan for this by giving yourself time. 


3. the first time hardware burns you, okay, fine. maybe you were using it wrong?  but the second or third time a rebooter causes an outage?  Defenestrate that shit. 

Edit:   we're back.   turns out my provider was still at the datacenter, just, you know, doing shit rather than answering his phone, so he rebooted cattle, and srn fixed it, so everyone should be back/coming back.

