Jewel is down again.

| | Comments (0)
Please note, I'm adding the new updates at the top;  jewel is up right now.  Read down for the history.  

edit at 20:33:  we've disabled hyperthreading and we've got it on the new kernel and on the e1000e ethernet adaptor.  It should be up and stable;  the raid is still rebuilding, but yeah, I think we are in okay shape for now.    We'll be moving people off this server as we get more capacity up;  email us if you want to move to the front of the line.  

The raid is still rebuilding, so expect less than stellar performance.   

edit at 18:13:   a new crash

[  163.452490] physdev match: using --physdev-out in the OUTPUT, FORWARD and POSTROUTING chains for non-bridged traffic is not supported anymore.
[  163.452944] physdev match: using --physdev-out in the OUTPUT, FORWARD and POSTROUTING chains for non-bridged traffic is not supported anymore.
(XEN) ----[ Xen-4.0.1  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff82c480167617>] do_nmi_stats+0x27/0x110
(XEN) RFLAGS: 0000000000010602   CONTEXT: hypervisor
(XEN) rax: 0000000000000001   rbx: 0000000000000027   rcx: 0000000000000000
(XEN) rdx: ffff8304b9340000   rsi: ffff8304b9340000   rdi: ffff830459777c20
(XEN) rbp: ffff8304b9340000   rsp: ffff83043ff27cd8   r8:  ffff8300bf42c000
(XEN) r9:  ffff8304b9340000   r10: 1480000000000002   r11: 0000000000000246
(XEN) r12: ffff8304b9340000   r13: ffff8304b9340000   r14: 0000000000000000
(XEN) r15: 00000008a5b57027   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000843101000   cr2: 00007f1c17d0aa60
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83043ff27cd8:
(XEN)    0000000000000000 ffff82c48015bcd8 0000000000000000 ffff82c48022a400
(XEN)    0000000200000000 ffff8300bf42c000 ffff8304b9340000 0000000000000000
(XEN)    ffffffffffffffff 0000000000000027 ffff8304b9340000 00000000008430ff
(XEN)    ffff83084300b248 000000000084300b 00000008a5b57027 ffff82c48016465b
(XEN)    ffffffffffffffff ffff82c48015f903 0000000000000000 ffff8300bf42c000
(XEN)    0000000000000000 000000000084300b ffff8304b9340000 00000008a5b57027
(XEN)    000000000084300b 0000000000000000 0000000000000000 00000001ffffffff
(XEN)    ffff8304b9340000 000000000084300b 0000000000000000 00000000008430ff
(XEN)    ffff83084300b248 000000000084300b 00000008a5b57027 ffff82c480165d36
(XEN)    0000000000000002 ffff8300bf42c030 0000000000000282 0000000000000020
(XEN)    0000000000000000 0000000000000282 ffff8300bf42c000 ffff8300bf2fa000
(XEN)    00007ff000000001 0000000000000000 000002004f1028e8 ffff82f610860160
(XEN)    000000498011b816 ffff8304b9340000 ffff8304b9340000 ffff8300bf42c000
(XEN)    000000084300b067 ffff82c48025a100 ffff82c480145405 ffff8300bf42c030
(XEN)    0000000000097490 0000004980145405 000000084300b248 00000008a5b57027
(XEN)    ffff83043ff27f28 0000000000000002 0000000000000000 ffff83043ff27f28
(XEN)    0000000180372980 0000000000000001 ffff82c48025a080 ffff8300bf42c000
(XEN)    000000000096ff90 00000000008430ff 00000000000000f0 0000000000000aa1
(XEN)    0000000000097000 ffff82c4801e5169 0000000000097000 0000000000000aa1
(XEN)    00000000000000f0 00000000008430ff 000000000096ff90 000000001e1ff000
(XEN) Xen call trace:
(XEN)    [<ffff82c480167617>] do_nmi_stats+0x27/0x110
(XEN)    [<ffff82c48015bcd8>] get_page+0x28/0xf0
(XEN)    [<ffff82c48016465b>] mod_l1_entry+0x37b/0x9c0
(XEN)    [<ffff82c48015f903>] get_page_and_type_from_pagenr+0x93/0xf0
(XEN)    [<ffff82c480165d36>] do_mmu_update+0x9f6/0x1a70
(XEN)    [<ffff82c480145405>] reprogram_timer+0x55/0x90
(XEN)    [<ffff82c4801e5169>] syscall_enter+0xa9/0xae
(XEN)    
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************
(XEN) 


I've seen some similar things having to do with the intel hardware virtualization, so I disabled all hardware virtualization, and I disabled hyperthreading.  booting again.  


edit at 17:00:   the kernel was upgraded some time ago, but the system still crashes when we start guests.  We can't get it to crash without starting guests, so we're grasping here;  we're going to use the onboard e1000e rather than the usb, now that we have the good kernel in place.  Nick is en-route to the data center.    

original post:

We're going to update the kernel to latest (the old one we were using was built for our amd mcp55 systems, and it's in a modern intel server right now)  


(XEN) ----[ Xen-3.4.1  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff828c801207d0>] compat_xen_version+0x3e0/0x420
[Sat Aug  6 14:14:12 2011](XEN) RFLAGS: 0000000000010082   CONTEXT: hypervisor
(XEN) rax: 00000000000003b8   rbx: ffff8308df024ce0   rcx: 0000000000000004
(XEN) rdx: 00002072659edae4   rsi: ffff830c3fdc7e88   rdi: ffff828c802855a0
(XEN) rbp: ffff8308df024cb0   rsp: ffff830c3fdc7ec8   r8:  0000000000012252
(XEN) r9:  0000000000000004   r10: 0000000000000005   r11: 0000000000000000
(XEN) r12: ffff8308df024d70   r13: 00002072659ec3c7   r14: 0000000000000000
(XEN) r15: ffff828c80221100   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 00000009030ae000   cr2: 00000000000000d4
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff830c3fdc7ec8:
[Sat Aug  6 14:14:12 2011](XEN)    ffff82ec80173b66 ffff828c8025f900 00000000802
5e900 ffff830c3fdc7f28
(XEN)    ffff828c8025e900 ffff828c8021f5b0 00002072659ec3c7 0000000000000000
(XEN)    ffff828c80138ed7 0000000000002000 ffff8300bf23c000 ffff8300bf0e4000
(XEN)    0000000001301c00 ffffffff8057d160 ffffffff8057c520 ffffffffffffffff
(XEN)    0000000000631918 0000000000000000 0000000000000246 0000000000631918
(XEN)    ffff880001939ee0 ffffffff805cbe48 0000000000000000 ffffffff802083aa
(XEN)    ffffffff80553f28 0000000000000000 0000000000000001 0000010000000000
(XEN)    ffffffff802083aa 000000000000e033 0000000000000246 ffffffff80553f10
(XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000002 ffff8300bf23c000
[Sat Aug  6 14:14:12 2011](XEN) Xen call trace:
(XEN)    [<ffff828c801207d0>] compat_xen_version+0x3e0/0x420
(XEN)    [<ffff828c80138ed7>] idle_loop+0x47/0xa0
(XEN)    
(XEN) Pagetable walk from 00000000000000d4:
(XEN)  L4[0x000] = 00000009030e8067 000000000001f2cc
(XEN)  L3[0x000] = 00000009030e7067 000000000001f2cd
(XEN)  L2[0x000] = 0000000000000000 ffffffffffffffff 
(XEN) 
(XEN) ****************************************
[Sat Aug  6 14:14:12 2011](XEN) Panic on CPU 2:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 00000000000000d4
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

Leave a comment

About this Entry

This page contains a single entry by luke published on August 6, 2011 3:00 PM.

Hardware issues with jewel was the previous entry in this blog.

'power event' at he.net is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.