Holmes crash due to out of SW-IOMMU space

| | Comments (0)
We brought everyone back up after Sunday's woes and then shortly had a crash similar to one we had in February on its partner taft (see http://blog.prgmr.com/xenophilia/2013/02/taft-rebooted-itself.html) 

The domains are coming back now, or should already be up (this downtime was probably on the order of half an hour.)

We followed the advice of this email http://old-list-archives.xenproject.org/archives/html/xen-devel/2007-09/msg00140.html and doubled the size of the SW-IOMMU by adding swiotlb=128 to the dom0 command line.

Before:
May 28 23:31:04 holmes kernel: Software IO TLB enabled:
May 28 23:31:04 holmes kernel:  Aperture:     64 megabytes

After:
May 29 01:13:27 holmes kernel: Software IO TLB enabled:
May 29 01:13:27 holmes kernel:  Aperture:     128 megabytes

This was the crash:

PCI-DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
[Wed May 29 08:29:09 2013]PCI-DMA: Out of SW-IOMMU space for 65536 bytes at d
evice 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:04:00.0
PCI-DMA: Out of SW-IOMMU space for 16384 bytes at device 0000:04:00.0
Unable to handle kernel paging request at ffff880040000010 RIP:
 [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
PGD 141f067 PUD 1621067 PMD 1622067 PTE 0
Oops: 0000 [1] SMP
last sysfs file: /block/md0/md/metadata_version
CPU 0
Modules linked in: ebt_arp ebt_ip ebtable_filter ebtables xt_physdev netloop
netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat bridge lockd sunrpc cpuf
req_ondemand acpi_cpufreq freq_table mperf ip_conntrack_netbios_ns ipt_REJECT
 xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpu
dp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm i
b_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api u
io cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 s
csi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmo
n i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport i2c_i
801 i2c_core sr_mod cdrom joydev sg pcspkr e1000e serial_core tpm_tis tpm tpm
_bios serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot
dm_zero dm_mirror dm_log dm_mod usb_storage ahci libata raid10 shpchp mpt2sas
 scsi_transport_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 464, comm: md0_raid1 Not tainted 2.6.18-348.6.
1.el5xen #1
RIP: e030:[<ffffffff880e6ffd>]  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
RIP: e030:[<ffffffff880e6ffd>]  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
RSP: e02b:ffffffff807a1d20  EFLAGS: 00010002
RAX: 0000000000000008 RBX: 0000000000000003 RCX: ffffffff880d7057
RDX: 2020202020202045 RSI: 0000000034202020 RDI: ffff88003e71fed8
RBP: ffff880040000000 R08: ffff880006658000 R09: ffff880006658000
R10: 0000001796adc000 R11: 0000000000000000 R12: ffff8800220e6e40
R13: ffff88003e7222f8 R14: ffff88003e9704f8 R15: ffff88003e71fee0
FS:  00002b3ead0801e0(0000) GS:ffffffff80639000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
[Wed May 29 08:29:09 2013]Process md0_raid1 (pid: 464, threadinfo ffff88003e1a0000, task ffff88003ee6c080)
Stack:  ffff88003f721418  ffff88003efe1000  00000bc18085b280  ffff88003ea5e0c0
 ffff88003ea5e080  94000000d5000000  fffffff4fffe3680  0000000300000bc1
 140000000000000f  ffff8800220e6e40
Call Trace:
 <IRQ>  [<ffffffff88084dbb>] :scsi_mod:scsi_dispatch_cmd+0x2ac/0x366
 [<ffffffff8808a506>] :scsi_mod:scsi_request_fn+0x2c7/0x39e
 [<ffffffff8025ebf5>] blk_run_queue+0x41/0x73
 [<ffffffff880893cd>] :scsi_mod:scsi_next_command+0x2d/0x39
 [<ffffffff8808954e>] :scsi_mod:scsi_end_request+0xbf/0xcd
[Wed May 29 08:29:09 2013] [<ffffffff880896da>] :scsi_mod:scsi_io_completion+0x17e/0x339
 [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff880b67ce>] :sd_mod:sd_rw_intr+0x21e/0x258
 [<ffffffff88089956>] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [<ffffffff80239541>] blk_done_softirq+0x67/0x75
 [<ffffffff80212f3a>] __do_softirq+0x8d/0x13b
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026eb41>] do_softirq+0x31/0x90
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff8034526a>] cfq_latter_request+0x0/0x1e
[<ffffffff8021cb71>] generic_make_request+0xb6/0x228
 [<ffffffff88075bda>] :raid1:flush_pending_writes+0x6d/0x8e
 [<ffffffff88076bf4>] :raid1:raid1d+0x39/0xbc3
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8026365a>] schedule_timeout+0x1e/0xad
 [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
[<ffffffff804153c7>] md_thread+0xf8/0x10e
 [<ffffffff8029f015>] autoremove_wake_function+0x0/0x2e
 [<ffffffff804152cf>] md_thread+0x0/0x10e
 [<ffffffff80233eb3>] kthread+0xfe/0x132
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff8029edfd>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80233db5>] kthread+0x0/0x132
 [<ffffffff80260b22>] child_rip+0x0/0x12
Code: 48 8b 55 10 49 8b 8e c8 03 00 00 75 06 8b 74 24 2c eb 04 8b
RIP  [<ffffffff880e6ffd>] :mpt2sas:_scsih_qcmd+0x44a/0x6e9
 RSP <ffffffff807a1d20>
CR2: ffff880040000010
 <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Leave a comment

About this Entry

This page contains a single entry by srn published on May 29, 2013 1:08 AM.

Packet loss issues now fixed was the previous entry in this blog.

holmes post-mortem is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.