Re: dma engine bugs

From: Chris Mason
Date: Thu Jan 17 2013 - 22:14:59 EST


On Thu, Jan 17, 2013 at 07:53:18PM -0700, Dan Williams wrote:
> On Thu, Jan 17, 2013 at 6:38 AM, Chris Mason <chris.mason@xxxxxxxxxxxx> wrote:
> > [ Sorry resend with the right address for Dan ]
> >
> > Hi Dan,
> >
> > I'm doing some benchmarking on MD raid5/6 on 4 fusionio cards in an HP
> > DL380p. I'm doing 128K randomw writes on a 4 drive raid6 with a 64K
> > stripe size per drive. I have 4 fio processes sending down the aio/dio,
> > and a high queue depth (8192).
> >
> > When I bump up the MD raid stripe cache size, I'm running into
> > soft lockups in the async memcopy code:
> >
> > [34336.959645] BUG: soft lockup - CPU#6 stuck for 22s! [fio:38296]
> > [34336.959648] BUG: soft lockup - CPU#9 stuck for 22s! [md0_raid6:5172]
> > [34336.959704] Modules linked in: raid456 async_raid6_recov async_pq
> > async_xor async_memcpy async_tx iomemory_vsl(O) binfmt_misc
> > cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq
> > mperf loop dm_mod coretemp kvm_intel kvm ghash_clmulni_intel sr_mod
> > cdrom aesni_intel ablk_helper cryptd lrw aes_x86_64 ata_generic xts
> > gf128mul ioatdma sb_edac gpio_ich ata_piix hid_generic dca edac_core
> > lpc_ich microcode serio_raw mfd_core hpilo hpwdt button container tg3 sg
> > acpi_power_meter usbhid mgag200 ttm drm_kms_helper drm i2c_algo_bit
> > sysimgblt sysfillrect syscopyarea uhci_hcd crc32c_intel ehci_hcd hpsa
> > processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc
> > scsi_dh_alua scsi_dh btrfs raid6_pq zlib_deflate xor libcrc32c asix
> > usbnet usbcore usb_common
> >
> > [34336.959709] CPU 9
> > [34336.959709] Pid: 5172, comm: md0_raid6 Tainted: G W O 3.7.1-1-default #2 HP ProLiant DL380p Gen8
> > [34336.959720] RIP: 0010:[<ffffffff815381ad>] [<ffffffff815381ad>] _raw_spin_unlock_irqrestore+0xd/0x20
> > [34336.959721] RSP: 0018:ffff8807af6db858 EFLAGS: 00000292
> > [34336.959722] RAX: 0000000000001000 RBX: ffff8810176fd000 RCX: 0000000000000292
> > [34336.959723] RDX: 0000000000001000 RSI: 0000000000000292 RDI: 0000000000000292
> > [34336.959724] RBP: ffff8807af6db858 R08: ffff881017e40440 R09: ffff880f554fabc0
> > [34336.959725] R10: 0000000000002000 R11: 0000000000000000 R12: ffff881017e40460
> > [34336.959726] R13: 0000000000000040 R14: 0000000000000001 R15: ffff881017e40480
> > [34336.959728] FS: 0000000000000000(0000) GS:ffff88103f660000(0000) knlGS:0000000000000000
> > [34336.959729] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [34336.959730] CR2: 00000000035cf458 CR3: 0000000001a0b000 CR4: 00000000000407e0
> > [34336.959731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [34336.959733] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [34336.959734] Process md0_raid6 (pid: 5172, threadinfo ffff8807af6da000, task ffff88077d7725c0)
> > [34336.959735] Stack:
> > [34336.959738] ffff8807af6db898 ffffffff8114f287 ffff8807af6db8b8 0000000000000000
> > [34336.959740] 0000000000000000 00000000005bd84a ffff881015f2fa18 ffff881017632a38
> > [34336.959742] ffff8807af6db8e8 ffffffffa057adf4 0000000000000000 ffff881015f2fa18
> > [34336.959743] Call Trace:
> > [34336.959750] [<ffffffff8114f287>] dma_pool_alloc+0x67/0x270
> > [34336.959758] [<ffffffffa057adf4>] ioat2_alloc_ring_ent+0x34/0xc0 [ioatdma]
> > [34336.959761] [<ffffffffa057afc5>] reshape_ring+0x145/0x370 [ioatdma]
> > [34336.959764] [<ffffffff8153841d>] ? _raw_spin_lock_bh+0x2d/0x40
> > [34336.959767] [<ffffffffa057b2d9>] ioat2_check_space_lock+0xe9/0x240 [ioatdma]
> > [34336.959768] [<ffffffff81538381>] ? _raw_spin_unlock_bh+0x11/0x20
> > [34336.959771] [<ffffffffa057b48c>] ioat2_dma_prep_memcpy_lock+0x5c/0x280 [ioatdma]
> > [34336.959773] [<ffffffffa03102df>] ? do_async_gen_syndrome+0x29f/0x3d0 [async_pq]
> > [34336.959775] [<ffffffff81538381>] ? _raw_spin_unlock_bh+0x11/0x20
> > [34336.959790] [<ffffffffa057ac22>] ? ioat2_tx_submit_unlock+0x92/0x100 [ioatdma]
> > [34336.959792] [<ffffffffa02f8207>] async_memcpy+0x207/0x1000 [async_memcpy]
> > [34336.959795] [<ffffffffa031f67d>] async_copy_data+0x9d/0x150 [raid456]
> > [34336.959797] [<ffffffffa03206ba>] __raid_run_ops+0x4ca/0x990 [raid456]
> > [34336.959802] [<ffffffff811b7c42>] ? __aio_put_req+0x102/0x150
> > [34336.959805] [<ffffffffa031c7ae>] ? handle_stripe_dirtying+0x30e/0x440 [raid456]
> > [34336.959807] [<ffffffffa03217a8>] handle_stripe+0x528/0x10b0 [raid456]
> > [34336.959810] [<ffffffffa03226f0>] handle_active_stripes+0x1e0/0x270 [raid456]
> > [34336.959814] [<ffffffff81293bb3>] ? blk_flush_plug_list+0xb3/0x220
> > [34336.959817] [<ffffffffa03229a0>] raid5d+0x220/0x3c0 [raid456]
> > [34336.959822] [<ffffffff81413b0e>] md_thread+0x12e/0x160
> > [34336.959828] [<ffffffff8106bfa0>] ? wake_up_bit+0x40/0x40
> > [34336.959829] [<ffffffff814139e0>] ? md_rdev_init+0x110/0x110
> > [34336.959831] [<ffffffff8106b806>] kthread+0xc6/0xd0
> > [34336.959834] [<ffffffff8106b740>] ? kthread_freezable_should_stop+0x70/0x70
> > [34336.959849] [<ffffffff8154047c>] ret_from_fork+0x7c/0xb0
> > [34336.959851] [<ffffffff8106b740>] ? kthread_freezable_should_stop+0x70/0x70
> >
> > Since I'm running on fast cards, I assumed MD was just hammering on this
> > path so much that MD needed a cond_resched(). But now that I've
> > sprinkled conditional pixie dust everywhere I'm still seeing exactly the
> > same trace, and the lockups keep flowing forever, even after I've
> > stopped all new IO.
> >
> > Looking at ioat2_check_space_lock(), it is looping when the ring
> > allocation fails. We're trying to grow our ring with atomic allocations
> > and not giving up the CPU?
> >
> > I'm not sure what the right answer is for a patch. If it is safe for
> > the callers we could add the cond_resched() but still we might fail to
> > grow the ring.
>
> We're already doing a cond_resched() once per-stripe batch so that
> should be sufficient.

Sorry, where? I think we're looping in ioat2_check_space_lock() doing a
goto retry over and over again. It doesn't quite make sense because
that must mean that there must be a reshape pending or that reshape_ring
is returning true.

>
> >
> > Would it be better to fallback to synchronous operations if we can't get
> > into the ring?
>
> Falling back is supposed to be the default in the memcpy path. We try
> to expand the ring and retry the slot allocation if the ring gets
> bigger, otherwise fail the allocation and do it synchronously. Is it
> always the async_memcpy path, does async_gen_syndrome get in on the
> action? I will take a closer look at the code tonight.

Always the async_memcpy, as far as I can tell, we never leave the
ioat2_check_space_lock function once it triggers.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/