srn: October 2014 Archives

DDOS followup

| | Comments (0)
The attack was an SSDP reflection attack. We have sent abuse reports to Time Warner Communications, Comcast, China Unicom, and Charter Communications.  These accounted for about half of the unique IPs observed as well as a significant portion of traffic.

High packet loss on cattle/girdle

| | Comments (0)
UPDATE 17:11: this is possibly due to problems on our primary network, investigating.
Separate from the DDOS attack, we are experiencing occasional high packet loss with cattle and girdle.  We have contacted our upstream provider regarding the issue.

Incoming DDOS attack earlier today

| | Comments (0)
We experienced an inbound DDOS attack starting at approximately 15:30 PST -0700.  Luke identified the destination and black-holed it around 15:45.  I worked with the customer in question to try to identify any factors that may have contributed to them being targeted or that made the DDOS worse.

girdle back up

| | Comments (0)
The raid is rebuilding.  Apologies for the extra 40 minute downtime. We accidentally sent the wrong drive size but our service provider had a drive of the correct size on hand and replaced it for us. Then grub was not installed correctly, which meant using a boot cd... in any case we are OK now.

Girdle reboot about 12 hours from now

| | Comments (0)
We will be rebooting girdle on Wednesday, October 22 between 9-10PM PST -0700 in order to bring a new drive online.

ext4 boot partition support added

| | Comments (0)
pv-grub has been updated to the version from centos 6.  This supports ext4 and has been verified to work with ext4 from debian wheezy.  Please contact us if you find any issues.

rescue image / distributions updated

| | Comments (0)
Instances on all hosts other than birds, stables, and lion need to only reboot in order to see the changes. 

Instances on birds and stables need to log into the management console and do a full shutdown (shutdown -h now) and start the instance from the management console.

The three remaining customers on lion need to be moved off - if you have not received an email please contact

The new distribution list is as follows:
arch-2014.09-64 (installer only)
centos7-docker-64 (docker pre-installed, not docker image)
netbsd-6.1.5-32 (installer only)
netbsd-6.1.5-64 (installer only)
ubuntu-trusty-14.04-docker-64 (docker pre-installed, not docker image)

Netboot installers are available from the rescue image for all of the pre-configured distributions, though the memory requirements are much higher to install than to run. centos 5 is also available as an installer. Please refer to the kickstart/preseed files in for examples of how to make sure your install will work on our systems.

We will be updating our documentation over the next several days to reflect the new installation procedures for netbsd and arch linux, as well as a discussion of best practices for configuring a new install.

Unfortunately right now it is difficult to release new rescue images and distributions very quickly. The locally stored block devices are always attached to the guests, even when not in use, and we don't want to dynamically detach them.  Within the next few months we intend to move over to a system without these constraints such that we can distribute new images on a regular schedule and when critical security updates are released.

SSH unreliable on trygve

| | Comments (0)
UPDATE 2014-10-16 14:15 PST -0700:  The problem has temporarily been fixed by setting "UsePAM no" in sshd_config, however there is no root cause immediately obvious.  If/when I have more information I will post it here.

Making an ssh connection to trygve may be difficult but not impossible right now.  We are investigating the issue; however right now you can run the following at the command line:

while ! ssh <user>; do true; done

Once the connection is established there should be no issues.
An overnight rebuild failed, apparently due to an issue with, and the new build did not complete early enough to make tonight's maintenance window.

Since the likely customer impact is very low, instead of rescheduling maintenance I will check for instances that rebooted around the time the rescue image and distributions are swapped out and start them manually.  I will send an update when this is complete.
UPDATE:  The file sync is taking much longer than expected - only field has been updated.  For everyone else I am rescheduling to Wednesday 2014-10-15, again from 20:00-22:00 PST -0700.
Tonight I will be switching out the distribution and rescue images. I will do that sometime between 20:00 to 22:00 -0700. Guests that attempt to reboot during that time may have to be started from the management console. I will give an update when everything is complete.
I ran vgs across all the servers and saw a warning like this:

Found duplicate PV QeHF8dKmUSAHDDdpyreQUQ0i93hRPDJK: using /dev/sde2 not /dev/md1

It was a scary message but ultimately turned out to be a false alarm.

This script will show whether any LVM's are mounted directly on a hard drive and not an md device:

for i in $(ls -d /sys/block/sd?); do
for j in $(ls -d $i/sd*); do
echo $j:$(ls $j/holders);

Any results containing "dm" are Bad. Fortunately there were no servers where this was unexpectedly true.

It turns out that setting

md_component_detection = 1

in /etc/lvm/lvm.conf is not sufficient to avoid scanning the hard drives.

For one thing, in the highly unlikely case there's a problem with the md component filter it will not cause lvm to fail. For another, if the lvm cache in
/etc/lvm/cache/.cache contains hard drives it appears that they will still be scanned.

To completely fix, first change the "filter" line in /etc/lvm/lvm.conf to be

filter = [ "a|/dev/md|", "r|.*|" ]

and then remove or move /etc/lvm/cache/.cache .

About this Archive

This page is a archive of recent entries written by srn in October 2014.

srn: September 2014 is the previous archive.

srn: November 2014 is the next archive.

Find recent content on the main index or look in the archives to find all content.