Re: Patch "mm: hugetlbfs: close race during teardown of hugetlbfsshared page tables" has been added to the 3.4-stable tree

From: Paul Gortmaker
Date: Tue Aug 14 2012 - 11:38:29 EST

Next message: Jan Engelhardt: "Re: [PATCH] scripts/package/builddeb: upgrade to current practices"
Previous message: Marcelo Tosatti: "Re: [PATCH v5 00/12] KVM: introduce readonly memslot"
Next in thread: Mel Gorman: "Re: Patch "mm: hugetlbfs: close race during teardown of hugetlbfsshared page tables" has been added to the 3.4-stable tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Aug 13, 2012 at 2:14 PM, <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> This is a note to let you know that I've just added the patch titled
>
> mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
>
> to the 3.4-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary
>
> The filename of the patch is:
> mm-hugetlbfs-close-race-during-teardown-of-hugetlbfs-shared-page-tables.patch
> and it can be found in the queue-3.4 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable@xxxxxxxxxxxxxxx> know about it.

The original was:

include/linux/hugetlb.h | 11 +++++++++++
mm/hugetlb.c | 28 ++++++++++++++++++++++++++--
mm/memory.c | 2 +-
3 files changed, 38 insertions(+), 3 deletions(-)

while the patch below is:

mm/hugetlb.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)

It wasn't obvious to me if this was intentional, so I thought I'd
mention it.

Paul.
--

>
>
> From d833352a4338dc31295ed832a30c9ccff5c7a183 Mon Sep 17 00:00:00 2001
> From: Mel Gorman <mgorman@xxxxxxx>
> Date: Tue, 31 Jul 2012 16:46:20 -0700
> Subject: mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
>
> From: Mel Gorman <mgorman@xxxxxxx>
>
> commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream.
>
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
>
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced
> in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
>
> [ ..........] Lots of bad pmd messages followed by this
> [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
> [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
> [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
> [ 127.186778] ------------[ cut here ]------------
> [ 127.186781] kernel BUG at mm/filemap.c:134!
> [ 127.186782] invalid opcode: 0000 [#1] SMP
> [ 127.186783] CPU 7
> [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
> [ 127.186801]
> [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
> [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
> [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
> [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
> [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
> [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
> [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
> [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
> [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
> [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
> [ 127.186821] Stack:
> [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
> [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
> [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
> [ 127.186827] Call Trace:
> [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
> [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
> [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
> [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0
> [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0
> [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50
> [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130
> [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0
> [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230
> [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
> [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30
> [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80
> [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360
> [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
> [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
> [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
> [ 127.186870] RSP <ffff8804144b5c08>
> [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---
>
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
>
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
> $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
> $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
>
> On my particular machine, it usually triggers within 10 minutes but
> enabling debug options can change the timing such that it never hits.
> Once the bug is triggered, the machine is in trouble and needs to be
> rebooted. The machine will respond but processes accessing proc like "ps
> aux" will hang due to the BUG_ON. shutdown will also hang and needs a
> hard reset or a sysrq-b.
>
> The basic problem is a race between page table sharing and teardown. For
> the most part page table sharing depends on i_mmap_mutex. In some cases,
> it is also taking the mm->page_table_lock for the PTE updates but with
> shared page tables, it is the i_mmap_mutex that is more important.
>
> Unfortunately it appears to be also insufficient. Consider the following
> situation
>
> Process A Process B
> --------- ---------
> hugetlb_fault shmdt
> LockWrite(mmap_sem)
> do_munmap
> unmap_region
> unmap_vmas
> unmap_single_vma
> unmap_hugepage_range
> Lock(i_mmap_mutex)
> Lock(mm->page_table_lock)
> huge_pmd_unshare/unmap tables <--- (1)
> Unlock(mm->page_table_lock)
> Unlock(i_mmap_mutex)
> huge_pte_alloc ...
> Lock(i_mmap_mutex) ...
> vma_prio_walk, find svma, spte ...
> Lock(mm->page_table_lock) ...
> share spte ...
> Unlock(mm->page_table_lock) ...
> Unlock(i_mmap_mutex) ...
> hugetlb_no_page <--- (2)
> free_pgtables
> unlink_file_vma
> hugetlb_free_pgd_range
> remove_vma_list
>
> In this scenario, it is possible for Process A to share page tables with
> Process B that is trying to tear them down. The i_mmap_mutex on its own
> does not prevent Process A walking Process B's page tables. At (1) above,
> the page tables are not shared yet so it unmaps the PMDs. Process A sets
> up page table sharing and at (2) faults a new entry. Process B then trips
> up on it in free_pgtables.
>
> This patch fixes the problem by adding a new function
> __unmap_hugepage_range_final that is only called when the VMA is about to
> be destroyed. This function clears VM_MAYSHARE during
> unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA
> ineligible for sharing and avoids the race. Superficially this looks like
> it would then be vunerable to truncate and madvise issues but hugetlbfs
> has its own truncate handlers so does not use unmap_mapping_range() and
> does not support madvise(DONTNEED).
>
> This should be treated as a -stable candidate if it is merged.
>
> Test program is as follows. The test case was mostly written by Michal
> Hocko with a few minor changes to reproduce this bug.
>
> ==== CUT HERE ====
>
> static size_t huge_page_size = (2UL << 20);
> static size_t nr_huge_page_A = 512;
> static size_t nr_huge_page_B = 5632;
>
> unsigned int get_random(unsigned int max)
> {
> struct timeval tv;
>
> gettimeofday(&tv, NULL);
> srandom(tv.tv_usec);
> return random() % max;
> }
>
> static void play(void *addr, size_t size)
> {
> unsigned char *start = addr,
> *end = start + size,
> *a;
> start += get_random(size/2);
>
> /* we could itterate on huge pages but let's give it more time. */
> for (a = start; a < end; a += 4096)
> *a = 0;
> }
>
> int main(int argc, char **argv)
> {
> key_t key = IPC_PRIVATE;
> size_t sizeA = nr_huge_page_A * huge_page_size;
> size_t sizeB = nr_huge_page_B * huge_page_size;
> int shmidA, shmidB;
> void *addrA = NULL, *addrB = NULL;
> int nr_children = 300, n = 0;
>
> if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> perror("shmget:");
> return 1;
> }
>
> if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
> perror("shmat");
> return 1;
> }
> if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
> perror("shmget:");
> return 1;
> }
>
> if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
> perror("shmat");
> return 1;
> }
>
> fork_child:
> switch(fork()) {
> case 0:
> switch (n%3) {
> case 0:
> play(addrA, sizeA);
> break;
> case 1:
> play(addrB, sizeB);
> break;
> case 2:
> break;
> }
> break;
> case -1:
> perror("fork:");
> break;
> default:
> if (++n < nr_children)
> goto fork_child;
> play(addrA, sizeA);
> break;
> }
> shmdt(addrA);
> shmdt(addrB);
> do {
> wait(NULL);
> } while (--n > 0);
> shmctl(shmidA, IPC_RMID, NULL);
> shmctl(shmidB, IPC_RMID, NULL);
> return 0;
> }
>
> [akpm@xxxxxxxxxxxxxxxxxxxx: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
> Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Reviewed-by: Michal Hocko <mhocko@xxxxxxx>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>
>
> ---
> mm/hugetlb.c | 25 +++++++++++++++++++++++--
> 1 file changed, 23 insertions(+), 2 deletions(-)
>
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2392,6 +2392,22 @@ void unmap_hugepage_range(struct vm_area
> {
> mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> __unmap_hugepage_range(vma, start, end, ref_page);
> + /*
> + * Clear this flag so that x86's huge_pmd_share page_table_shareable
> + * test will fail on a vma being torn down, and not grab a page table
> + * on its way out. We're lucky that the flag has such an appropriate
> + * name, and can in fact be safely cleared here. We could clear it
> + * before the __unmap_hugepage_range above, but all that's necessary
> + * is to clear it before releasing the i_mmap_mutex below.
> + *
> + * This works because in the contexts this is called, the VMA is
> + * going to be destroyed. It is not vunerable to madvise(DONTNEED)
> + * because madvise is not supported on hugetlbfs. The same applies
> + * for direct IO. unmap_hugepage_range() is only being called just
> + * before free_pgtables() so clearing VM_MAYSHARE will not cause
> + * surprises later.
> + */
> + vma->vm_flags &= ~VM_MAYSHARE;
> mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> }
>
> @@ -2958,9 +2974,14 @@ void hugetlb_change_protection(struct vm
> }
> }
> spin_unlock(&mm->page_table_lock);
> - mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> -
> + /*
> + * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
> + * may have cleared our pud entry and done put_page on the page table:
> + * once we release i_mmap_mutex, another task can do the final put_page
> + * and that page table be reused and filled with junk.
> + */
> flush_tlb_range(vma, start, end);
> + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> }
>
> int hugetlb_reserve_pages(struct inode *inode,
>
>
> Patches currently in stable-queue which might be from mgorman@xxxxxxx are
>
> queue-3.4/mm-hugetlbfs-close-race-during-teardown-of-hugetlbfs-shared-page-tables.patch
> queue-3.4/mm-fix-wrong-argument-of-migrate_huge_pages-in-soft_offline_huge_page.patch
> --
> To unsubscribe from this list: send the line "unsubscribe stable-commits" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jan Engelhardt: "Re: [PATCH] scripts/package/builddeb: upgrade to current practices"
Previous message: Marcelo Tosatti: "Re: [PATCH v5 00/12] KVM: introduce readonly memslot"
Next in thread: Mel Gorman: "Re: Patch "mm: hugetlbfs: close race during teardown of hugetlbfsshared page tables" has been added to the 3.4-stable tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]