luke: June 2011 Archives

crock hung approx. 12 hours ago

| | Comments (0)
I rebooted it just now and it's back.    I will take steps to prevent myself from sleeping through outages like this.

I believe the crash was related to the Bug: Soft lockup problem that has been plaguing us for some time.   I am testing a new kernel that I hope will solve this problem.

 http://xenbits.xen.org/people/mayoung/testing/SRPMS/

edit: please note, this is unrelated to the wiki.xen.prgmr.com outage

crock hung approx. 12 hours ago

| | Comments (0)
I rebooted it just now and it's back.   sorry.   we will be getting people a 25% of a month credit for this outage.

server move tonight

| | Comments (0)
effected customers should have gotten email; this is our status update.  downtime is about to begin for knife, council, and cauldron.  

edit at 00:57  customers are coming back up now.   If you wish to log in to your console server and manually start your server you can do so now, otherwise your box should be started within the next 20 minutes or so.  

edit at 01:09: knife is completely up

edit at 01:14: council and cauldron are both up.  

also note, we won't be moving any more servers for at least 48 hours.   

network outage this morning

| | Comments (2)
right now, my upstream believes the problem is not my equipment;  they gave me a 30 minute ETA, but it sounds like it's too early to tell what the problem is.   


As of this moment, we believe the problem is unrelated to the server move, but like I said, this is very early in the process and it's possible that's wrong.  I will update in 30.

update: 08:06  we appear to be back.   I will update when we know more about what the actual problem was.  

update: 8:25:  bowl and cerberos are still down, working on it...  bowl and cerberus are related to last nights move.  


update: 9:57

my upstream says:
"Appologies,
There have been multiple links affected by this fiber cut,
It is still in progress of being repaired, however the primarly link to SVTIX
+should be working now.

We are still waiting for details on the cause of the disruption.
"


update at 2011-06-25:
" Sorry for the delay, the lates news we have is "I can tell you right now that
+there was a fiber cut due to construction excavating to put in a new water
+main. ", and we already request a RFO for this outage, and it take 7-10
+business day for them to process.
"

cauldron crash

| | Comments (0)
cauldron crashed, three hours ago.  It's coming back up now.  Investigation to follow.


BUG: soft lockup detected on CPU#0!

[Wed Jun  1 15:55:45 2011]Call Trace:
 <IRQ> [<ffffffff8025758a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df48>] timer_interrupt+0x3a0/0x3fa
 [<ffffffff80257874>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80257960>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd5c>] do_IRQ+0x44/0x4d
 [<ffffffff8034c980>] evtchn_do_upcall+0x19e/0x250
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff803581ea>] show_rd_sect+0x0/0x68
 [<ffffffff802ebbfc>] __read_lock_failed+0x8/0x14
[Wed Jun  1 15:55:45 2011] [<ffffffff80343f3e>] get_device+0x17/0x20
 [<ffffffff803fc3fd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff80358211>] show_rd_sect+0x27/0x68
 [<ffffffff802bc351>] sysfs_read_file+0xa5/0x12e
 [<ffffffff8027e3f5>] vfs_read+0xcb/0x171
 [<ffffffff8027e7d4>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5


edit:  all users on cauldron are up and back online.  

About this Archive

This page is a archive of recent entries written by luke in June 2011.

luke: May 2011 is the previous archive.

luke: July 2011 is the next archive.

Find recent content on the main index or look in the archives to find all content.