Re: [PATCH] nvme-rdma: complete requests from ->timeout

From: Sagi Grimberg
Date: Fri Dec 07 2018 - 15:05:43 EST



Could you please take a look at this bug and code review?

We are seeing more instances of this bug and found that reconnect_work
could hang as well, as can be seen from below stacktrace.

Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Call Trace:
__schedule+0x2ab/0x880
schedule+0x36/0x80
schedule_timeout+0x161/0x300
? __next_timer_interrupt+0xe0/0xe0
io_schedule_timeout+0x1e/0x50
wait_for_completion_io_timeout+0x130/0x1a0
? wake_up_q+0x80/0x80
blk_execute_rq+0x6e/0xa0
__nvme_submit_sync_cmd+0x6e/0xe0
nvmf_connect_admin_queue+0x128/0x190 [nvme_fabrics]
? wait_for_completion_interruptible_timeout+0x157/0x1b0
nvme_rdma_start_queue+0x5e/0x90 [nvme_rdma]
nvme_rdma_setup_ctrl+0x1b4/0x730 [nvme_rdma]
nvme_rdma_reconnect_ctrl_work+0x27/0x70 [nvme_rdma]
process_one_work+0x179/0x390
worker_thread+0x4f/0x3e0
kthread+0x105/0x140
? max_active_store+0x80/0x80
? kthread_bind+0x20/0x20

This bug is produced by setting MTU of RoCE interface to '568' for
test while running I/O traffics.

I think that with the latest changes from Keith we can no longer rely
on blk-mq to barrier racing completions. We will probably need
to barrier ourselves in nvme-rdma...