Deadlocks with transparent huge pages and userspace fs daemons

From: Dave Hansen
Date: Wed Nov 03 2010 - 16:43:39 EST


Hey Miklos,

When testing with a transparent huge page kernel:

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/andrea/aa.git;a=summary

some IBM testers ran into some deadlocks. It appears that the
khugepaged process is trying to migrate one of a filesystem daemon's
pages while khugepaged holds the daemon's mmap_sem for write.

I think I've reproduced this issue in a slightly different form with
FUSE. In my case, I think the FUSE process actually deadlocks on itself
instead of with khugepaged as in the IBM tester example that got me
looking at this.

Andrea put it this way:
> As long as page faults are needed to execute the I/O I doubt it's safe. But
> I'll definitely change khugepaged not to allocate memory. If nothing else
> because I don't want khugepaged to make easier to trigger issues like this. But
> it's hard for me to consider this a bug of khugepaged from a theoretical
> standpoint.

I tend to agree. khugepaged makes the likelyhood of these things
happening much higher, but I don't think it fundamentally creates the
issue.

Should we do something like make page compaction always non-blocking on
lock_page()? Should we teach the VM about fuse daemons somehow?

INFO: task unionfs:3527 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
unionfs D ffff88007d356ec0 0 3527 3478 0x00000000
ffff88007b0db9a8 0000000000000082 ffffea00000650c8 ffff88007d356c70
ffff88007d1286a0 000000000000000d 0000000000000000 0000000000000301
ffff88007b0db978 ffffffff81098f70 ffff88007b0dba58 ffff880001db1f40
Call Trace:
[<ffffffff81098f70>] ? vma_prio_tree_next+0x3c/0x52
[<ffffffff813eb183>] io_schedule+0x38/0x4d
[<ffffffff8108683a>] sync_page+0x44/0x48
[<ffffffff813eb5e7>] __wait_on_bit_lock+0x42/0x8a
[<ffffffff810867f6>] ? sync_page+0x0/0x48
[<ffffffff810867e2>] __lock_page+0x64/0x6b
[<ffffffff810467bb>] ? wake_bit_function+0x0/0x2a
[<ffffffff810bce62>] migrate_pages+0x1df/0x66b
[<ffffffff810b8b33>] ? compaction_alloc+0x0/0x2b9
[<ffffffff8108fa2c>] ? ____pagevec_lru_add+0x13c/0x14f
[<ffffffff810b85e5>] compact_zone+0x331/0x54d
[<ffffffff810b89e4>] compact_zone_order+0xaa/0xb9
[<ffffffff810b8acd>] try_to_compact_pages+0xda/0x140
[<ffffffff8108c3f0>] __alloc_pages_nodemask+0x3a6/0x74b
[<ffffffff810b5db5>] alloc_pages_vma+0x110/0x13d
[<ffffffff810c6d6d>] do_huge_pmd_anonymous_page+0xc0/0x287
[<ffffffff810a0ed7>] handle_mm_fault+0x15c/0x201
[<ffffffff813efa5c>] do_page_fault+0x304/0x422
[<ffffffff810a5e8a>] ? do_brk+0x282/0x2c8
[<ffffffff813ed40f>] page_fault+0x1f/0x30

I had to make some changes to the transparent huge page code to get this
to happen. First, I made the scanning *REALLY* aggressive:

echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

Then, I hacked migrate_pages() call of unmap_and_move() to always
'force', so that it tries to lock_page() unconditionally. That's just
to make this race more common. I also created some large malloc()'d
memory areas in the unionfs daemon and touched them constantly to cause
lots of page faults.

Other relevant tasks:

INFO: task mmap-and-touch:3584 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mmap-and-touc D ffff88007bd71510 0 3584 3542 0x00000000
ffff88007a591b88 0000000000000086 ffff88007bd57400 ffff88007bd712c0
ffff88007d01cd70 ffffffff00000004 ffff88007d22e578 ffff88005e5b7440
ffff88007a591b58 0000000181182a8c ffff88007a591b88 ffff880001c91f40
Call Trace:
[<ffffffff813eb183>] io_schedule+0x38/0x4d
[<ffffffff8108683a>] sync_page+0x44/0x48
[<ffffffff813eb5e7>] __wait_on_bit_lock+0x42/0x8a
[<ffffffff810867f6>] ? sync_page+0x0/0x48
[<ffffffff810867e2>] __lock_page+0x64/0x6b
[<ffffffff810467bb>] ? wake_bit_function+0x0/0x2a
[<ffffffff810868a1>] find_lock_page+0x39/0x5d
[<ffffffff81087f60>] filemap_fault+0x1a6/0x30e
[<ffffffff8109e5e0>] __do_fault+0x50/0x432
[<ffffffff8109f636>] handle_pte_fault+0x2db/0x717
[<ffffffff8108b67c>] ? __free_pages+0x1b/0x24
[<ffffffff810a0d6c>] ? __pte_alloc+0x112/0x121
[<ffffffff810a0f64>] handle_mm_fault+0x1e9/0x201
[<ffffffff813efa5c>] do_page_fault+0x304/0x422
[<ffffffff810cc83d>] ? sys_newfstat+0x29/0x34
[<ffffffff813ed40f>] page_fault+0x1f/0x30
INFO: task memknobs:3599 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
memknobs D ffff88007d305b20 0 3599 3573 0x00000000
ffff88005e4539a8 0000000000000086 ffff88005e453978 ffff88007d3058d0
ffff88007dbb60d0 ffffea0000000002 000000003963d000 ffff88007a4c11e8
ffffea000033aa10 000000017b1e69e0 ffff88005e453988 ffff880001c51f40
Call Trace:
[<ffffffff813eb183>] io_schedule+0x38/0x4d
[<ffffffff8108683a>] sync_page+0x44/0x48
[<ffffffff813eb5e7>] __wait_on_bit_lock+0x42/0x8a
[<ffffffff810867f6>] ? sync_page+0x0/0x48
[<ffffffff810867e2>] __lock_page+0x64/0x6b
[<ffffffff810467bb>] ? wake_bit_function+0x0/0x2a
[<ffffffff810bce62>] migrate_pages+0x1df/0x66b
[<ffffffff810b8b33>] ? compaction_alloc+0x0/0x2b9
[<ffffffff8108fa2c>] ? ____pagevec_lru_add+0x13c/0x14f
[<ffffffff810b85e5>] compact_zone+0x331/0x54d
[<ffffffff810b89e4>] compact_zone_order+0xaa/0xb9
[<ffffffff810b8acd>] try_to_compact_pages+0xda/0x140
[<ffffffff8108c3f0>] __alloc_pages_nodemask+0x3a6/0x74b
[<ffffffff810b5db5>] alloc_pages_vma+0x110/0x13d
[<ffffffff810c6d6d>] do_huge_pmd_anonymous_page+0xc0/0x287
[<ffffffff810a0ed7>] handle_mm_fault+0x15c/0x201
[<ffffffff813efa5c>] do_page_fault+0x304/0x422
[<ffffffff81020e5e>] ? __dequeue_entity+0x2e/0x33
[<ffffffff81000e25>] ? __switch_to+0x22a/0x23c
[<ffffffff81020e7b>] ? set_next_entity+0x18/0x36
[<ffffffff81022e83>] ? finish_task_switch+0x3c/0x81
[<ffffffff813eb0a5>] ? schedule+0x6f4/0x79a
[<ffffffff813ed40f>] page_fault+0x1f/0x30
INFO: task khugepaged:515 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
khugepaged D ffff88007d1e8360 0 515 2 0x00000000
ffff88007cad5d00 0000000000000046 ffff88007cad5cc0 ffff88007d1e8110
ffff88007d0986e0 0000000000000008 ffff88007cad5ce0 ffffffff81037e33
00000000ffffffff 000000017cad5d50 00000001000dd090 0000000000000002
Call Trace:
[<ffffffff81037e33>] ? lock_timer_base+0x26/0x4a
[<ffffffff813ec8af>] rwsem_down_failed_common+0xcc/0xfe
[<ffffffff813ec8f4>] rwsem_down_write_failed+0x13/0x15
[<ffffffff811ccef3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff813ec09b>] ? down_write+0x20/0x22
[<ffffffff810c6174>] khugepaged+0xee0/0xf5f
[<ffffffff81046783>] ? autoremove_wake_function+0x0/0x38
[<ffffffff810c5294>] ? khugepaged+0x0/0xf5f
[<ffffffff810462ce>] kthread+0x81/0x89
[<ffffffff81002cf4>] kernel_thread_helper+0x4/0x10
[<ffffffff8104624d>] ? kthread+0x0/0x89
[<ffffffff81002cf0>] ? kernel_thread_helper+0x0/0x10


Original stack trace from GPFS deadlock:

> khugepaged D ffff88007c823080 0 52 2 0x00000000
> ffff8800378c98f0 0000000000000046 0000000000000000 001a7949f3208ca4
> ffffffffffffff10 ffff880079efc670 000000002b6c79c0 00000001169be651
> ffff88003780c638 ffff8800378c9fd8 0000000000010518 ffff88003780c638
> Call Trace:
> [<ffffffff8110c060>] ? sync_page+0x0/0x50
> [<ffffffff814c8a23>] io_schedule+0x73/0xc0
> [<ffffffff8110c09d>] sync_page+0x3d/0x50
> [<ffffffff814c914a>] __wait_on_bit_lock+0x5a/0xc0
> [<ffffffff8110c037>] __lock_page+0x67/0x70
> [<ffffffff81091ce0>] ? wake_bit_function+0x0/0x50
> [<ffffffff81122461>] ? lru_cache_add_lru+0x21/0x40
> [<ffffffff8115b730>] lock_page+0x30/0x40
> [<ffffffff8115bdad>] migrate_pages+0x59d/0x5d0
> [<ffffffff81152470>] ? compaction_alloc+0x0/0x370
> [<ffffffff81151f1c>] compact_zone+0x4ac/0x5e0
> [<ffffffff8111cd1c>] ? get_page_from_freelist+0x15c/0x820
> [<ffffffff811522ce>] compact_zone_order+0x7e/0xb0
> [<ffffffff81152409>] try_to_compact_pages+0x109/0x170
> [<ffffffff8111e62c>] __alloc_pages_nodemask+0x55c/0x810
> [<ffffffff81150374>] alloc_pages_vma+0x84/0x110
> [<ffffffff8116530f>] khugepaged+0xa4f/0x1190
> [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffff811648c0>] ? khugepaged+0x0/0x1190
> [<ffffffff81091936>] kthread+0x96/0xa0
> [<ffffffff810141ca>] child_rip+0xa/0x20
> [<ffffffff810918a0>] ? kthread+0x0/0xa0
> [<ffffffff810141c0>] ? child_rip+0x0/0x20
>
>
> mmfsd D ffff88007c823680 0 4453 4118 0x00000080
> ffff88001ad1ddf0 0000000000000082 0000000000000000 0000000000000000
> 0000000000000000 ffff880037fcee40 ffff880079d40ab0 00000001169be9c1
> ffff8800782b7ad8 ffff88001ad1dfd8 0000000000010518 ffff8800782b7ad8
> Call Trace:
> [<ffffffff814c8286>] ? thread_return+0x4e/0x778
> [<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430
> [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ca846>] rwsem_down_read_failed+0x26/0x30
> [<ffffffff81264224>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffff814c9d44>] ? down_read+0x24/0x30
> [<ffffffff814cd6fa>] do_page_fault+0x34a/0x3a0
> [<ffffffff814caf45>] page_fault+0x25/0x30


-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/