February 2015 Archives

Downtime on nagios and wiki tomorrow

| | Comments (0)
Nagios and wiki will be down sometime within the window of Sat Feb 28 16:00-19:00 PST -0800.
We are making infrastructure changes over the next few days and are contacting customers directly regarding such changes.

If you update your contact information, please also inform support such that you do not miss any major news.

sorry for breaking rack 05-10

| | Comments (0)
00:28 <+srn_prgmr> these are down: beak.prgmr.com
00:28 <+srn_prgmr> coat.prgmr.com
00:28 <+srn_prgmr> dao.prgmr.com
00:28 <+srn_prgmr> dish.prgmr.com
00:28 <+srn_prgmr> fisher.prgmr.com
00:28 <+srn_prgmr> halter.prgmr.com
00:28 <+srn_prgmr> jay.prgmr.com
00:28 <+srn_prgmr> marshall.prgmr.com

but yeah.   all of these servers were down because I messed up the switch config in 05-10.    I'm sorry.  

to prevent this from happening again, I plan on moving all our dedicated server customers to their own switch/router setup.   that way, I will only break dedicated server customers when adding/removing new customers (and with only one switch involved, the config will be way less complex, too) 

Update on fisher

| | Comments (0)
One move is in progress; two are remaining of the original 50+.

The motivation for bringing the system down was that during a planned move, an md5sum on the source and destination data did not match. After verifying the instance was down and rerunning md5sum twice locally, the md5sum still did not match across any of those 4 times.

The total amount of data md5sum'ed was around 200GB and a single bit error in any of those would have changed the checksum. It does not indicate mass corruption but I wanted to minimize the risk that bad data would be marked clean, which is why the power got pulled.

After bringing up the drives in a different chassis, md5sum ran 2x matched and also matched one of the md5sums collected from fisher, which gave us sufficient confidence that the data was OK to move.

We are performing a memory test on the old fisher box and will follow up with other tests to try to understand what happened and see if it can be detected less easily in the future.
Fisher is having a hard time with the server moves, so we are going to bring it down and move over disks to our backup server.  The customers on fisher should hopefully not see too much downtime, we will get you up as soon as possible.

Anyone who received notice that your server was going to move tonight, well that wasn't wrong but it will take a bit longer than first thought. 

More updates to come.  If you have any questions or concerns please mail support@prgmr.com

seashell is down

| | Comments (0)
UPDATE 21:06 PST: All customers should be back up. Total downtime was approximately 1 hour 20 minutes.
UPDATE 20:56 PST -0800: There are no error messages in syslog indicating the reason for the reboot. The drives have been moved to a different chassis with a very similar processor. We will be bringing customers back up shortly.
Luke is on his way to the data center to investigate. 18 customers are affected.

pvgrub updated

| | Comments (0)
This should take care of the following error:

xc: error: panic: xc_dom_bzimageloader.c:699: xc_dom_probe_bzimage_kernel unable to XZ decompress
kernel: Invalid kernel

Please email support if you run into any issues.

discussion of fisher downtime

| | Comments (0)
Last night (3am our time) fisher rebooted itself and at that reboot it was not running xen. I then rebooted fisher into xen because it seemed like the right thing to do at the time.

I was not paged for the original reboot because xen rebooted it automatically and the amount of time the machine was unavailable was less than the alert cutoff. We will be monitoring a canary on fisher in addition to fisher itself, so should this happen again I should get paged.

There were these errors in the log last night:

Feb  1 00:43:30 fisher kernel: xend[32715] general protection rip:2af0a12bda2e rsp:7fffa1954b90 error:7ffc
Sun Feb  1 00:43:30 2015]Bad pte = 8100000040800165, process = ???, vm_flags = 100073, vaddr = 3eb7531000

Call Trace:
 [<ffffffff8020cf43>] vm_normal_page+0xe9/0xfd
 [<ffffffff80207860>] unmap_vmas+0x699/0xc68
 [<ffffffff8023b7dc>] exit_mmap+0x8f/0x10b
 [<ffffffff8023de91>] mmput+0x30/0x82
 [<ffffffff8021611d>] do_exit+0x2e5/0x91f
 [<ffffffff8024b11d>] cpuset_exit+0x0/0x88
 [<ffffffff8022bfcd>] get_signal_to_deliver+0x477/0x4aa
[Sun Feb  1 00:43:30 2015] [<ffffffff8025d368>] do_notify_resume+0x9c/0x7ba
 [<ffffffff80297fb3>] force_sig_info+0xae/0xb9
 [<ffffffff80268081>] do_page_fault+0x12c5/0x131b
 [<ffffffff8026fa63>] monotonic_clock+0x35/0x7b
 [<ffffffff80262df5>] thread_return+0x6c/0x113
 [<ffffffff8026076a>] retint_signal+0x5d/0xb7

swap_free: Bad swap offset entry b284022000000
swap_free: Bad swap offset entry 2a18a02a000000
Call Trace:
[Sun Feb  1 01:25:17 2015] <IRQ>  [<ffffffff802299ae>] __kfree_skb+0x9/0x1a
 [<ffffffff8885e36f>] :netbk:net_rx_action+0x7f1/0x8e2
 [<ffffffff88228286>] :e1000e:e1000e_poll+0x2ce/0x38b
 [<ffffffff8029299e>] tasklet_action+0x97/0x13b
 [<ffffffff80212f3a>] __do_softirq+0x8d/0x13b
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026eb41>] do_softirq+0x31/0x90
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
[Sun Feb  1 01:25:17 2015] [<ffffffff80270026>] raw_safe_halt+0x87/0xab
 [<ffffffff8026d52b>] xen_idle+0x38/0x4a
 [<ffffffff8024b23c>] cpu_idle+0x97/0xba
 [<ffffffff80764b11>] start_kernel+0x21f/0x224
 [<ffffffff807641e5>] _sinittext+0x1e5/0x1eb

Code: 8b 43 08 85 c0 7f 26 48 c7 c1 17 2e 4a 80 ba 99 00 00 00 48
RIP  [<ffffffff80425173>] skb_release_head_state+0x12/0xf8
 RSP <ffffffff807a1e20>
[Sun Feb  1 01:25:17 2015] <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
Doing further research it seemed more and more likely there could be a hardware error. I had not checked the BIOS error log at 3am and dmidecode was not able to show the error log. The kernel we are running does not have an EDAC module which works with this chipset so it could not be used to monitor for errors. I did not feel comfortable leaving the server running without checking the logs. We rebooted to look at the BIOS log and no error was shown there. 

Normally we would just move the hard drives to a different chassis if there was any question about the hardware. However, our new standard server has fewer disks than fisher, so this was not possible.

There was a chassis with enough drive bays free. However it had just put a new raid controller in, which had not been configured and we are not yet familiar with, so I didn't want to muck with it.  There was a live server with enough free drive bays but it was not located on the same network, so it could have only been used as a method of reading data from fisher's drives and could not be used as a host for fisher's guests.

The shortest path to getting everyone back up seemed to be to run memtest and to concurrently provision a new server should it be necessary. An hour of memtest did not show any issues. I considered leaving the server down and moving everyone one-by-one, however that would have resulted in up to about 7 hours downtime for the last guest. I could not justify this amount of downtime based on the available evidence so fisher has been brought back up.

We will still be moving people off of fisher but will try to do so in a fashion that minimizes overall/unexpected downtime. We will contact you before any move.

more downtime on fisher

| | Comments (0)
After investigating there is a high probability (not 100% certain) that there is a memory problem on fisher. While the probability of any one person experiencing an error is extremely low, I am bringing down all the customers in order to prevent the possibility. We are in the process of identifying alternative space and will send an update when we have done so.

fisher.prgmr.com - Unplanned downtime

| | Comments (0)
Fisher went down unexpectedly, and did not boot xen dom0s properly when it came back up.  We will post more when we have all of the details.

One month's credit will be applied to all users on fisher later today.

About this Archive

This page is an archive of entries from February 2015 listed from newest to oldest.

December 2014 is the previous archive.

March 2015 is the next archive.

Find recent content on the main index or look in the archives to find all content.