nick: June 2011 Archives

lozenges crash

| | Comments (0)
Lozenges crashed this morning after we went home from moving some of the other servers to market post tower. It looks like a bug in Xen 4.0.1:
 (XEN) Xen BUG at page_alloc.c:1204
(XEN) ----[ Xen-4.0.1  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c4801159f2>] free_domheap_pages+0x1f2/0x380
(XEN) RFLAGS: 0000000000010206   CONTEXT: hypervisor
(XEN) rax: 007fffffffffffff   rbx: ffff83007bfd0000   rcx: 0000000000000000
(XEN) rdx: ffff82f600a7f660   rsi: 0000000000000000   rdi: ffff83007bfd0014
(XEN) rbp: ffff830051d13000   rsp: ffff82c48035fa58   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: ffff83007bfd0018   r11: 0000000000000000
[Wed Jun 22 11:49:16 2011](XEN) r12: 0000000000000001   r13: ffff82f600a7f660   r14: 0000000000000000
(XEN) r15: ffff83007bfd0014   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 000000031280c000   cr2: ffff8800369c0358
(XEN) ds: 0000   es: 0000   fs: 0063   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c48035fa58:
(XEN)    000000017bfd0000 ffff830051d12f08 ffff830051d13000 ffff83007bfd0000
(XEN)    ffff83007bfd0000 0000000000000000 ffff830000000000 ffff82c48015ce25
(XEN)    0000000002150c70 000000003e2c7d38 0000000000000156 1400000000000001
(XEN)    ffff82f600a3a240 1400000000000001 ffff82c48035ff28 0000000000000000
(XEN)    ffff830000000000 ffff82c48015d179 0000000100000246 ffff82f600a3a240
[Wed Jun 22 11:49:16 2011](XEN)    000000000007ea58 ffff82f600fd4b00 ffff83007bfd0000 ffff83007ea58000
(XEN)    ffff82c48035ff28 ffff82c48015c951 0000000000000052 000000000007ea58
(XEN)    ffff82f600fd4b00 ffff83007bfd0000 ffff83007ea58000 ffff82c48015cbef
(XEN)    0000000100db89c0 0000000000000000 0000000000000156 2400000000000001
(XEN)    ffff82f600fd4b00 2400000000000001 ffff82c48035ff28 0000000000000001
(XEN)    ffff830000000000 ffff82c48015d179 0000000100000242 ffff82f600fd4b00
(XEN)    ffff83007ea59000 ffff82f600fd4b20 0000000000000000 000000000007ea59
(XEN)    ffff83007ea59000 ffff82c48015d987 0000000000000000 ffff83007ea59000
(XEN)    ffff82f600fd4b20 ffff82c48015cd79 0000000100000000 00000000000534b0
(XEN)    ffff83007bfd0000 3400000000000001 ffff82f600fd4b20 3400000000000001
[Wed Jun 22 11:49:16 2011](XEN)    ffff82c48035ff28 0000000000000001 ffff830000000000 ffff82c48015d179
(XEN)    ffff82f600a7bcc0 ffff82f600fd4b20 0000000000000000 ffff82f600fd4e80
(XEN)    000000000007ea74 ffff83007bfd0000 ffff83007ea74000 ffff82c48015d825
(XEN)    0000000000000001 0000000000000140 0000000000000000 ffff82c48015cb41
(XEN)    0000000100a7bcc0 00000000ffffffe0 0000000000000156 4400000000000001
(XEN) Xen call trace:
(XEN)    [<ffff82c4801159f2>] free_domheap_pages+0x1f2/0x380
(XEN)    [<ffff82c48015ce25>] free_page_type+0x4c5/0x670
(XEN)    [<ffff82c48015d179>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48015c951>] put_page_from_l2e+0xe1/0xf0
[Wed Jun 22 11:49:16 2011](XEN)    [<ffff82c48015cbef>] free_page_type+0x28f/0x670
(XEN)    [<ffff82c48015d179>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48015d987>] put_page_from_l3e+0x157/0x170
(XEN)    [<ffff82c48015cd79>] free_page_type+0x419/0x670
(XEN)    [<ffff82c48015d179>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48015d825>] put_page_from_l4e+0xd5/0xe0
(XEN)    [<ffff82c48015cb41>] free_page_type+0x1e1/0x670
(XEN)    [<ffff82c48015d179>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48014be85>] relinquish_memory+0x1e5/0x500
(XEN)    [<ffff82c48014c64d>] domain_relinquish_resources+0x1ad/0x280
[Wed Jun 22 11:49:16 2011](XEN)    [<ffff82c480106250>] domain_kill+0x80/0xf0
(XEN)    [<ffff82c48010430e>] do_domctl+0x1be/0xff0
(XEN)    [<ffff82c48011bc70>] get_cpu_idle_time+0x20/0x30
(XEN)    [<ffff82c4801e5169>] syscall_enter+0xa9/0xae
(XEN)   
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at page_alloc.c:1204
(XEN) ****************************************
[Wed Jun 22 11:49:16 2011](XEN)
(XEN) Reboot in five seconds...
So hopefully its fixed in 4.0.2 that just came out. Meanwhile, it looks like everybody is back up, but we will probably not put more new customers on lozenges for a while. Please email support@prgmr.com if you are still having trouble. Thanks.
We are planning to move the servers apples, cerberus, bowl and branch 
on Tuesday evening this week (June 21, 2011) to the new data center.
Expect the downtime to start sometime after 9PM PDT, and everything
should be back up by midnight. Branch needs a new disk also, so we are
going to take it down earlier and rebuild the raid with the new disk before
the move. It will be down starting at 7PM PDT , and when it finishes
rebuilding we will start shutting down the other 3 servers.

If that goes well, hopefully we can move the second group of servers from the eliteweb cabinet on Wednesday night (knife, cauldron and council). See http://book.xen.prgmr.com/mediawiki/index.php/EGI_Moving
If you have any questions, please email support@prgmr.com. It might be delayed if there is much support email to answer from the earlier move or just regular tickets also though. Thanks!

Update: We're starting later for the first move. Now it will start more like midnight or after.

Update 00:08  -  we are rebooting and rebuilding branch before the move.  expect branch downtime to start in a few minutes,  expect downtime for the other servers to start in maybe three hours.

Update 00:35: Branch is down and the RAID is rebuilding.     nobody else is rebooting yet.

Update 05:54:  branch is done rebuilding, apples, cerberus, bowl down for moving

Update: 07:29:  all servers are coming  back up, but all network connectivity is down.  At this point, my provider seems to think it's something between, and not my network, but it's too early to say.  

Update: 08:26:  Cerberus doesn't want to boot 'cause I removed a bad drive (a spare had replaced it in the RAID)   -  I put the drive back and it boots.    Bowl is also down, I don't know what that problem is yet.  

Update:08:38: bowl and branch are both starting xendomains, the problem with bowl was an upgrade blew out the menu.st file.  the problem with branch was that we incorrectly set power on after power fail to off.

update: 08:51:  all users should now be up and running.  complain otherwise.   




Halter has 2 failed disks. They are on different mirrors, so there is no data loss, but it really needs to be taken care of. Because halter doesn't have the ahci sata chipset, we need to reboot it to detect the new disks. We will check on vps that don't boot up by themself, but if you still have problems email support@prgmr.com. Thanks! -Nick

we are attempting to fix this with the next kernel revision.  If nothing else, we'll be cycling that hardware out fairly soon anyhow.

edit at 1:01 :  guest domains are going down now.

edit at 2:09:   drives are replaced, rebuilding.   We will bring customers online in approx. an hour.  


Personalities : [raid1]                                                         
md1 : active raid1 sdc2[2] sdb2[1]                                              
      477901504 blocks [2/1] [_U]                                               
      [==>..................]  recovery = 13.4% (64153152/477901504) finish=65.6
min speed=105075K/sec                                                           
                                                                                
md2 : active raid1 sda2[2] sdd2[1]                                              
      477901504 blocks [2/1] [_U]                                               
      [==>..................]  recovery = 11.2% (53639488/477901504) finish=75.6
min speed=93412K/sec                                                            
                             
 
this is one of the old servers that uses 2 raid1s that are striped with lvm rather than 1 raid10. raid10 rebuilds /much/ faster under load than the striped raid1 setup we used on halter and all older servers.  Considering that the thing lost two disks in as many weeks, we will keep it down until it's done rebuilding.     (the rebuild would take 10x-15x as long if we did it while customers were online)


edit at 2:27:  ignore my approx an hour comment.  current mdstat output:

                                                                                
Personalities : [raid1]                                                         
md1 : active raid1 sdc2[2] sdb2[1]                                              
      477901504 blocks [2/1] [_U]                                               
      [======>..............]  recovery = 34.2% (163703680/477901504) finish=93.
9min speed=55756K/sec                                                           
                                                                                
md2 : active raid1 sda2[2] sdd2[1]                                              
      477901504 blocks [2/1] [_U]                                               
      [======>..............]  recovery = 31.0% (148422272/477901504) finish=98.
2min speed=55877K/sec  


edit at 4:43:  the disks are finally done rebuilding and customers are coming back up.  

edit at 4:48:  halter is back, all users on halter are back.  

we need to give all you all credit.  

About this Archive

This page is a archive of recent entries written by nick in June 2011.

nick: May 2011 is the previous archive.

nick: September 2011 is the next archive.

Find recent content on the main index or look in the archives to find all content.