Re: [Nbd] NBD: Disconnect connection/kill NBD server cause kernel bug even kernel hang

From: Sheng Yang
Date: Mon Sep 21 2015 - 18:13:20 EST


Thank you Wouter!

(Add back lkml since I got another kernel issue...)

In fact I just hit another issue with nbd-server/client on the same
machine. The following commands hang after a while:

dd if=/dev/zero of=./test

When process hanging, dd/nbd-server are no long showing in high cpu
utilization. Ctrl-C or any operation on the NBD device mount point
won't work and hang immediately.

I enabled kernel lockup detection and found this:

[ 240.127132] INFO: task kworker/u8:1:28 blocked for more than 120 seconds.
[ 240.129350] Not tainted 4.3.0-rc2+ #13
[ 240.131392] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 240.135290] kworker/u8:1 D ffff88013fc95f58 12160 28 2 0x00000000
[ 240.138358] Workqueue: writeback wb_workfn (flush-43:0)
[ 240.140886] ffff88013a46b5d8 0000000000000082 ffffffff8106bd3a
ffffffff00000000
[ 240.144799] ffff8800bafe3d80 ffff88013a460000 ffff88013a46c000
ffff88013fc95f40
[ 240.148712] 7fffffffffffffff ffff880139524e30 ffff880139524e60
ffff88013a46b5f0
[ 240.152568] Call Trace:
[ 240.153915] [<ffffffff8106bd3a>] ? __queue_delayed_work+0x9a/0x190
[ 240.156202] [<ffffffff8191734e>] schedule+0x2e/0x70
[ 240.158188] [<ffffffff8191b41c>] schedule_timeout+0x1bc/0x2b0
[ 240.160373] [<ffffffff810f677a>] ? __delayacct_blkio_start+0x1a/0x30
[ 240.162714] [<ffffffff810f677a>] ? __delayacct_blkio_start+0x1a/0x30
[ 240.164964] [<ffffffff810446de>] ? kvm_clock_get_cycles+0x1e/0x20
[ 240.167126] [<ffffffff810be5b0>] ? ktime_get+0x90/0x110
[ 240.169074] [<ffffffff810f677a>] ? __delayacct_blkio_start+0x1a/0x30
[ 240.171301] [<ffffffff8191677f>] io_schedule_timeout+0x9f/0x110
[ 240.173409] [<ffffffff813193d5>] get_request+0x425/0x7d0
[ 240.175385] [<ffffffff81318fb0>] ?
trace_event_raw_event_block_get_rq+0xe0/0xe0
[ 240.178256] [<ffffffff813201c8>] ? ioc_lookup_icq+0x88/0xd0
[ 240.180206] [<ffffffff81092990>] ? wait_woken+0x80/0x80
[ 240.182038] [<ffffffff8131ceeb>] blk_queue_bio+0xcb/0x2f0
[ 240.183948] [<ffffffff8131870f>] generic_make_request+0xaf/0xf0
[ 240.185932] [<ffffffff813187b2>] submit_bio+0x62/0x140
[ 240.187778] [<ffffffff8113583b>] ? __test_set_page_writeback+0x15b/0x240
[ 240.189837] [<ffffffff81216699>] ext4_io_submit+0x39/0x50
[ 240.191648] [<ffffffff8121685a>] ext4_bio_write_page+0x18a/0x3b0
[ 240.193548] [<ffffffff8120da18>] mpage_submit_page+0x58/0x80
[ 240.195405] [<ffffffff8120dc75>] mpage_map_and_submit_buffers+0x145/0x260
[ 240.197481] [<ffffffff81212d5a>] ext4_writepages+0x5ba/0xcb0
[ 240.199335] [<ffffffff813d5291>] ? vp_notify+0x11/0x20
[ 240.201590] [<ffffffff811367cc>] do_writepages+0x1c/0x30
[ 240.203935] [<ffffffff811b9820>] __writeback_single_inode+0x50/0x300
[ 240.206589] [<ffffffff8191c012>] ? _raw_spin_unlock+0x22/0x30
[ 240.209054] [<ffffffff811ba108>] writeback_sb_inodes+0x298/0x520
[ 240.211589] [<ffffffff811ba414>] __writeback_inodes_wb+0x84/0xb0
[ 240.214146] [<ffffffff811ba730>] wb_writeback+0x250/0x2c0
[ 240.216427] [<ffffffff811bae43>] wb_workfn+0x213/0x390
[ 240.218507] [<ffffffff8106d113>] process_one_work+0x1c3/0x4c0
[ 240.220757] [<ffffffff8106d0a1>] ? process_one_work+0x151/0x4c0
[ 240.223061] [<ffffffff8106d736>] worker_thread+0x46/0x440
[ 240.225225] [<ffffffff8106d6f0>] ? rescuer_thread+0x2e0/0x2e0
[ 240.227508] [<ffffffff81072aff>] kthread+0xdf/0x100
[ 240.229433] [<ffffffff81072a20>] ? kthread_create_on_node+0x200/0x200
[ 240.231809] [<ffffffff8191cb9f>] ret_from_fork+0x3f/0x70
[ 240.233888] [<ffffffff81072a20>] ? kthread_create_on_node+0x200/0x200
[ 240.236301] 4 locks held by kworker/u8:1/28:
[ 240.238074] #0: ("writeback"){......}, at: [<ffffffff8106d0a1>]
process_one_work+0x151/0x4c0
[ 240.241916] #1: ((&(&wb->dwork)->work)){......}, at:
[<ffffffff8106d0a1>] process_one_work+0x151/0x4c0
[ 240.245960] #2: (&type->s_umount_key#31){......}, at:
[<ffffffff8118e9e6>] trylock_super+0x16/0x50
[ 240.249995] #3: (jbd2_handle){......}, at: [<ffffffff81256f72>]
start_this_handle+0x112/0x440
[ 240.253729] INFO: task jbd2/nbd0-8:2235 blocked for more than 120 seconds.
[ 240.256081] Not tainted 4.3.0-rc2+ #13
[ 240.257692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 240.260691] jbd2/nbd0-8 D ffff88013fd95f58 13488 2235 2 0x00000000
[ 240.263250] ffff8800ba29bc58 0000000000000086 ffff8800ba29bd40
0000000000000000
[ 240.266402] ffff8801394c9480 ffff8800ba156680 ffff8800ba29c000
ffff8800ba29bd58
[ 240.269519] ffff8800ba156680 ffff8800ba29bd40 ffff8800bb4cea00
ffff8800ba29bc70
[ 240.272677] Call Trace:
[ 240.273805] [<ffffffff8191734e>] schedule+0x2e/0x70
[ 240.275565] [<ffffffff81259d35>]
jbd2_journal_commit_transaction+0x245/0x1760
[ 240.278277] [<ffffffff8108001c>] ? build_sched_domain+0x24c/0x2e0
[ 240.280335] [<ffffffff81092990>] ? wait_woken+0x80/0x80
[ 240.282200] [<ffffffff810b644a>] ? try_to_del_timer_sync+0x4a/0x60
[ 240.284285] [<ffffffff8125e780>] kjournald2+0xc0/0x260
[ 240.286059] [<ffffffff81092990>] ? wait_woken+0x80/0x80
[ 240.287865] [<ffffffff8125e6c0>] ? commit_timeout+0x10/0x10
[ 240.289761] [<ffffffff81072aff>] kthread+0xdf/0x100
[ 240.291485] [<ffffffff81072a20>] ? kthread_create_on_node+0x200/0x200
[ 240.293585] [<ffffffff8191cb9f>] ret_from_fork+0x3f/0x70
[ 240.295404] [<ffffffff81072a20>] ? kthread_create_on_node+0x200/0x200
[ 240.297513] no locks held by jbd2/nbd0-8/2235.
[ 240.299098] INFO: task dd:2255 blocked for more than 120 seconds.
[ 240.301078] Not tainted 4.3.0-rc2+ #13
[ 240.302586] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 240.305422] dd D ffff88013fd95f58 13240 2255 2254 0x00000004
[ 240.307864] ffff880139b3ba38 0000000000000086 0000000000000035
0000000000000000
[ 240.310966] ffff88013ab68000 ffff8801394c9480 ffff880139b3c000
0000000000000000
[ 240.314027] ffff8800bacc8900 0000000000000035 0000000000000000
ffff880139b3ba50
[ 240.317130] Call Trace:
[ 240.318222] [<ffffffff8191734e>] schedule+0x2e/0x70
[ 240.319979] [<ffffffff81256b0f>] wait_transaction_locked+0x7f/0xb0
[ 240.322010] [<ffffffff81092990>] ? wait_woken+0x80/0x80
[ 240.323872] [<ffffffff81256d82>] add_transaction_credits+0x212/0x2a0
[ 240.325980] [<ffffffff81256f65>] start_this_handle+0x105/0x440
[ 240.327882] [<ffffffff81079cf5>] ? finish_task_switch+0xa5/0x1b0
[ 240.329948] [<ffffffff81256a5b>] ? new_handle+0x1b/0x50
[ 240.331771] [<ffffffff812574ea>] jbd2__journal_start+0xba/0x190
[ 240.333716] [<ffffffff8121416c>] ? ext4_da_write_begin+0x10c/0x2d0
[ 240.335806] [<ffffffff8123f13c>] __ext4_journal_start_sb+0x6c/0xd0
[ 240.337835] [<ffffffff8121416c>] ext4_da_write_begin+0x10c/0x2d0
[ 240.339875] [<ffffffff8123f1dd>] ? __ext4_journal_stop+0x3d/0xb0
[ 240.341940] [<ffffffff81128f62>] generic_perform_write+0xa2/0x1e0
[ 240.343930] [<ffffffff811a66d0>] ? file_update_time+0xa0/0xe0
[ 240.345849] [<ffffffff8112b13b>] __generic_file_write_iter+0x18b/0x1e0
[ 240.347949] [<ffffffff81209d60>] ext4_file_write_iter+0x210/0x420
[ 240.349988] [<ffffffff8118e26c>] ? __sb_start_write+0xec/0x100
[ 240.351963] [<ffffffff8118b195>] __vfs_write+0xa5/0xe0
[ 240.353733] [<ffffffff8118b7e4>] vfs_write+0xa4/0x190
[ 240.355512] [<ffffffff8118b71f>] ? vfs_read+0x10f/0x130
[ 240.357310] [<ffffffff8118c4a4>] SyS_write+0x44/0xa0
[ 240.359014] [<ffffffff8191c832>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 240.361142] 2 locks held by dd/2255:
[ 240.362552] #0: (sb_writers#5){......}, at: [<ffffffff8118e26c>]
__sb_start_write+0xec/0x100
[ 240.366119] #1: (&sb->s_type->i_mutex_key#9){......}, at:
[<ffffffff81209cdf>] ext4_file_write_iter+0x18f/0x420

--Sheng

On Sun, Sep 20, 2015 at 11:52 PM, Wouter Verhelst <w@xxxxxxx> wrote:
> Hi,
>
> [linux-kernel CC dropped, since this is irrelevant for them]
>
> On Wed, Sep 16, 2015 at 06:18:50PM -0700, Sheng Yang wrote:
>> Hi, Markus,
>>
>> I've experienced kernel bug and system hang regarding using
>> nbd-client/server on the same machine. Kernel verison is v4.2.
>>
>> I know http://nbd.sourceforge.net/ said:
>> "Please note that read-write nbd with client and server on the same
>> machine is a bad idea: expect deadlock within seconds". But it's
>> mentioned at http://thread.gmane.org/gmane.linux.kernel/869784 that
>> NBD should be able to handle client/server on the same machine.
>
> Yes, that bug has been fixed. I need to figure out how I can update
> those pages again (sourceforge changed the procedure a few times), and
> remove that note.
>
> (the rest is kernel, so for Markus, not me)
>
> --
> It is easy to love a country that is famous for chocolate and beer
>
> -- Barack Obama, speaking in Brussels, Belgium, 2014-03-26
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/