RE: XFS memory allocation deadlock in 2.6.38

From: Sean Noonan
Date: Wed Mar 23 2011 - 15:39:16 EST


I believe this patch fixes the behavior:
diff --git a/mm/memory.c b/mm/memory.c
index e48945a..740d5ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3461,7 +3461,9 @@ int make_pages_present(unsigned long addr, unsigned long end)
* to break COW, except for shared mappings because these don't COW
* and we would not want to dirty them for nothing.
*/
- write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE;
+ write = (vma->vm_flags & VM_WRITE) != 0;
+ if (write && ((vma->vm_flags & VM_SHARED) !=0) && (vma->vm_file == NULL))
+ write = 0;
BUG_ON(addr >= end);
BUG_ON(end > vma->vm_end);
len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;


This was traced to the following commit:
5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 is the first bad commit
commit 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272
Author: Michel Lespinasse <walken@xxxxxxxxxx>
Date: Thu Jan 13 15:46:09 2011 -0800

mlock: avoid dirtying pages and triggering writeback

When faulting in pages for mlock(), we want to break COW for anonymous or
file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no
need to write-fault into VM_SHARED vmas since shared file pages can be
mlocked first and dirtied later, when/if they actually get written to.
Skipping the write fault is desirable, as we don't want to unnecessarily
cause these pages to be dirtied and queued for writeback.

Signed-off-by: Michel Lespinasse <walken@xxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Kosaki Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Nick Piggin <npiggin@xxxxxxxxx>
Cc: Theodore Tso <tytso@xxxxxxxxxx>
Cc: Michael Rubin <mrubin@xxxxxxxxxx>
Cc: Suleiman Souhlal <suleiman@xxxxxxxxxx>
Cc: Dave Chinner <david@xxxxxxxxxxxxx>
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>

:040000 040000 604eede2f45b7e5276ce9725b715ed15a868861d 3c175eadf4cf33d4f78d4d455c9a04f3df2c199e M mm


-----Original Message-----
From: Sean Noonan
Sent: Monday, March 21, 2011 12:20
To: 'linux-kernel@xxxxxxxxxxxxxxx'
Cc: Trammell Hudson; Martin Bligh; Stephen Degler; Christos Zoulas
Subject: XFS memory allocation deadlock in 2.6.38

This message was originally posted to the XFS mailing list, but received no responses. Thus, I am sending it to LKML on the advice of Martin.

Using the attached program, we are able to reproduce this bug reliably.
$ make vmtest
$ ./vmtest /xfs/hugefile.dat $(( 16 * 1024 * 1024 * 1024 )) # vmtest <path_to_file> <size_in_bytes>
/xfs/hugefile.dat: mapped 17179869184 bytes in 33822066943 ticks
749660: avg 13339 max 234667 ticks
371945: avg 26885 max 281616 ticks
---
At this point, we see the following on the console:
[593492.694806] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593506.724367] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593524.837717] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[593556.742386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

This is the same message presented in
http://oss.sgi.com/bugzilla/show_bug.cgi?id=410

We started testing with 2.6.38-rc7 and have seen this bug through to the .0 release. This does not appear to be present in 2.6.33, but we have not done testing in between. We have tested with ext4 and do not encounter this bug.
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_DEBUG is not set
# CONFIG_VXFS_FS is not set

Here is the stack from the process:
[<ffffffff81357553>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff812ddf1e>] xfs_ilock+0x7e/0x110
[<ffffffff8130132f>] __xfs_get_blocks+0x8f/0x4e0
[<ffffffff813017b1>] xfs_get_blocks+0x11/0x20
[<ffffffff8114ba3e>] __block_write_begin+0x1ee/0x5b0
[<ffffffff8114be9d>] block_page_mkwrite+0x9d/0xf0
[<ffffffff81307e05>] xfs_vm_page_mkwrite+0x15/0x20
[<ffffffff810f2ddb>] do_wp_page+0x54b/0x820
[<ffffffff810f347c>] handle_pte_fault+0x3cc/0x820
[<ffffffff810f5145>] handle_mm_fault+0x175/0x2f0
[<ffffffff8102e399>] do_page_fault+0x159/0x470
[<ffffffff816cf6cf>] page_fault+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

# uname -a
Linux testhost 2.6.38 #2 SMP PREEMPT Fri Mar 18 15:00:59 GMT 2011 x86_64 GNU/Linux

Please let me know if additional information is required.

Thanks!

Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/