Re: AIO/DIO lockup/crash

From: Andrew Morton
Date: Mon Apr 28 2008 - 12:11:10 EST


On Mon, 28 Apr 2008 14:29:42 +0200 Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> Hi guys,
>
> I'm getting this (and various variations thereof - like crashing in the
> PI chain code on -rt) when running aio-dio-invalidate-failure for a few
> hours.
>
> (dual core opteron - single spindle - ext3)
>
> Is this a known issue?
>
> I'll run the same on current -git overnight to see if it went away :-)
>
>
> [ 1796.238953] BUG: soft lockup - CPU#1 stuck for 11s! [aio-dio-invalid:3037]
> [ 1796.245794] CPU 1:
> [ 1796.247802] Modules linked in: autofs4 binfmt_misc ext2 psmouse evbug evdev i2c_piix4 i2c_core pcspkr thermal processor button sr_mod cdrom sg shpchp pci_hotplug sd_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore
> [ 1796.267532] Pid: 3037, comm: aio-dio-invalid Not tainted 2.6.24.4 #194
> [ 1796.274023] RIP: 0010:[<ffffffff804a7993>] [<ffffffff804a7993>] _spin_lock_irqsave+0x63/0x90
> [ 1796.282517] RSP: 0018:ffff81007fba7ce0 EFLAGS: 00000246
> [ 1796.287800] RAX: 0000000000000000 RBX: ffff81007fba7cf0 RCX: 0000000000001000
> [ 1796.294895] RDX: 0000000000000213 RSI: ffff810067dbc740 RDI: 0000000000000001
> [ 1796.301993] RBP: ffff81007fba7c60 R08: 0000000000000101 R09: 000000000169aa28
> [ 1796.309090] R10: 000000000169aa28 R11: 0000000000000003 R12: ffffffff8020d0c6
> [ 1796.316187] R13: ffff81007fba7c60 R14: ffff81007eaddc00 R15: ffff81007eaddf24
> [ 1796.323283] FS: 00002b489f45db00(0000) GS:ffff81007fb6cac0(0000) knlGS:0000000000000000
> [ 1796.331330] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1796.337043] CR2: 00000000008c7f1c CR3: 0000000068610000 CR4: 00000000000006e0
> [ 1796.344140] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1796.351237] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1796.358334]
> [ 1796.358334] Call Trace:
> [ 1796.362244] <IRQ> [<ffffffff802dee4a>] dio_bio_end_aio+0x3a/0xe0
> [ 1796.368405] [<ffffffff802dac79>] bio_endio+0x19/0x40
> [ 1796.373430] [<ffffffff8034fe8e>] req_bio_endio+0x4e/0xa0
> [ 1796.378800] [<ffffffff80350084>] __end_that_request_first+0x1a4/0x3c0
> [ 1796.385292] [<ffffffff803502a9>] end_that_request_chunk+0x9/0x10
> [ 1796.391354] [<ffffffff803e95fb>] scsi_end_request+0x3b/0x110
> [ 1796.397069] [<ffffffff803e99d5>] scsi_io_completion+0xa5/0x3b0
> [ 1796.402958] [<ffffffff804a7e06>] _spin_unlock_irqrestore+0x16/0x40
> [ 1796.409192] [<ffffffff803e3479>] scsi_finish_command+0x99/0xf0
> [ 1796.415079] [<ffffffff803ea515>] scsi_softirq_done+0x115/0x150
> [ 1796.420967] [<ffffffff803536db>] blk_done_softirq+0x6b/0x80
> [ 1796.426598] [<ffffffff802458c4>] __do_softirq+0x64/0xd0
> [ 1796.431883] [<ffffffff8020d61c>] call_softirq+0x1c/0x30
> [ 1796.437166] [<ffffffff8020efbd>] do_softirq+0x3d/0x90
> [ 1796.442276] [<ffffffff802457d8>] irq_exit+0x88/0xa0
> [ 1796.447213] [<ffffffff8020f095>] do_IRQ+0x85/0x100
> [ 1796.452064] [<ffffffff8020c971>] ret_from_intr+0x0/0xa
> [ 1796.457258] <EOI> [<ffffffff804a799e>] _spin_lock_irqsave+0x6e/0x90
> [ 1796.463678] [<ffffffff804a796e>] _spin_lock_irqsave+0x3e/0x90
> [ 1796.469479] [<ffffffff802ddded>] dio_bio_submit+0x2d/0x90
> [ 1796.474935] [<ffffffff802ddeee>] dio_send_cur_page+0x9e/0xa0
> [ 1796.480648] [<ffffffff802ddf2e>] submit_page_section+0x3e/0x130
> [ 1796.486623] [<ffffffff802deb39>] __blockdev_direct_IO+0x979/0xc50
> [ 1796.492783] [<ffffffff8806591f>] :ext3:ext3_direct_IO+0xaf/0x1c0
> [ 1796.498847] [<ffffffff88063ad0>] :ext3:ext3_get_block+0x0/0x110
> [ 1796.504825] [<ffffffff802851ba>] generic_file_direct_IO+0xba/0x160
> [ 1796.511059] [<ffffffff802852cf>] generic_file_direct_write+0x6f/0x130
> [ 1796.517551] [<ffffffff80285e13>] __generic_file_aio_write_nolock+0x383/0x440
> [ 1796.524650] [<ffffffff80285f34>] generic_file_aio_write+0x64/0xd0
> [ 1796.530802] [<ffffffff88060a26>] :ext3:ext3_file_write+0x26/0xc0
> [ 1796.536865] [<ffffffff88060a00>] :ext3:ext3_file_write+0x0/0xc0
> [ 1796.542841] [<ffffffff802cce4f>] aio_rw_vect_retry+0x6f/0x180
> [ 1796.548642] [<ffffffff802ccde0>] aio_rw_vect_retry+0x0/0x180
> [ 1796.554355] [<ffffffff802cda19>] aio_run_iocb+0x49/0x110
> [ 1796.559725] [<ffffffff802ce663>] io_submit_one+0x1d3/0x3f0
> [ 1796.565268] [<ffffffff802cf22e>] sys_io_submit+0xde/0x140
> [ 1796.570725] [<ffffffff8020c5dc>] tracesys+0xdc/0xe1

erk, that's dio->bio_lock, isn't it?

That lock is super-simple and hasn't changed in quite some time. If there
has been major memory wreckage and we're simply grabbing at a "lock" in
random memory then I'd expect the bug to maninfest in different ways on
different runs?

I assume you have lots of runtime debugging options enabled.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/