what exactly happened on holmes and how we will prevent that from happening again

| | Comments (0)
srn is working on the actual data recovery right now. 

First?  regardless of the results here, while holmes is down, I'm going to double the ram in the thing (and double ram for all users on holmes)   - holmes users will keep that ram doubling regardless of the results of our data recovery efforts.    (note, when allocations are increased for all users, your allocations won't go down, but your increase will be based on what you are paying for, not based on what you have now... e.g. if I double your ram now, and then I give all users 4x ram, you will get 4x what you are paying for, or double what you will have after I upgrade holmes.) 

Next, I'm going to explain a bit about how our pxe setup is setup, and how we will change this.

I mean, at root here?  I fucked up.  I should have checked the pxe before booting, and failing that, I should have been faster about yanking the power when it came up with 'welcome to Centos' instead of 'centos rescue' in the top left corner of the serial console.   I would have been less likely to make this mistake if I was less burnt out.  (I have pxe-booted servers into rescue mode hundreds, perhaps thousands of times with this system, and this is the first time I've made this mistake.)    So yeah.    It's /very/ good that  this was the last of the xen servers to move.  I still have dedicated servers to move, but I can do those one at a time, and usually I can have my co-lo customers come in and help me move their stuff for them. 

First?  the current setup:   Holmes has a mac address of 00:25:90:2c:a3:a0.  So, our dhcp server has an entry like so:

host holmes.prgmr.com {
        hardware ethernet 00:25:90:2c:a3:a0;
        option host-name "taft.prgmr.com";
        filename "pxelinux.0";

our tftp server has the file pxelinux.0, which then TFTP downloads the file


which has the following contents:

SERIAL 0 38400
default centos
label centos
kernel vmlinuz
append initrd=initrd.img serial console=ttyS0,38400n8 ks=http://www.schmalenberger.us/prgmr/holmes-ks.cfg ksdevice=00:25:90:2c:a3:a0
#append initrd=initrd.img serial console=ttyS0,38400n8
#append initrd=initrd.img serial console=ttyS0,38400n8 ksdevice=00:25:90:2c:a3:a0 rescue

(That password hash is temporary, and at no time is root login permitted over ssh anyhow, so I'm okay with this being public.   Really, the big weak (security) point in this system is if I somehow screw up the firewall rules that prevent customers from running their own DHCP servers...  but I'd have to screw up that rule /and/ the malicious customer would have to be running the dhcp server as I configured the server to pxeboot.  in normal operation, the servers boot off disk, not network.   But it's still a weak point.   I ought to PXE off of an internal, trusted-host-only network.)

Anyhow, by policy, after we install the server, we comment out the line with the ks= line (that's the install line)  and uncomment the

append initrd=initrd.img serial console=ttyS0,38400n8 rescue ksdevice=00:25:90:2c:a3:a0


I'm going through with grep right now and verifying that this has been done everywhere.  (It hasn't.  I'm fixing by hand.) 

For now, I've added 

grep ks /srv/tftp/pxelinux.cfg/* |grep -v "#append" |grep -v rescue

to my crontab, so I'll be emailed  if there is a dangerous pxelinux.cfg, but that's a pretty clumsy way of going about it.

Next?  I am going to change my standard .ks files.   First, I'm going to have Nick remove all the .ks files he has on his webserver  (without a .ks file, nothing is going to happen automatically.)  and I'm going to hide all the .ks files that I control, too.  

Going forward?  I'm going to automate the management of the kickstart files, but that's a project for another day.

Leave a comment

About this Entry

This page contains a single entry by luke published on May 27, 2013 1:29 PM.

I screwed up. was the previous entry in this blog.

Update on holmes is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.