[ 22/85] mm: hugetlbfs: correctly populate shared pmd

From: Greg Kroah-Hartman
Date: Wed Sep 12 2012 - 19:39:01 EST

From: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>

3.4-stable review patch. If anyone has any objections, please let me know.


From: Michal Hocko <mhocko@xxxxxxx>

commit eb48c071464757414538c68a6033c8f8c15196f8 upstream.

Each page mapped in a process's address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straightforward but hugetlbfs page table sharing is different. The page
table pages at the PMD level are reference counted while the mapcount
remains the same.

If this accounting is wrong, it causes bugs like this one reported by
Larry Woodman:

kernel BUG at mm/filemap.c:135!
invalid opcode: 0000 [#1] SMP
CPU 22
Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
RIP: 0010:[<ffffffff8112cfed>] [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
Call Trace:

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is
with a different process entirely then this check fails as in this

src_pte----------> data page

For this situation to occur, it must be possible for Parent and Other to
have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

copy_hugetlb_page_range copy_hugetlb_page_range
src_pte == huge_pte_offset src_pte == huge_pte_offset
!src_pte so no sharing !src_pte so no sharing

(time passes)

hugetlb_fault hugetlb_fault
huge_pte_alloc huge_pte_alloc
huge_pmd_share huge_pmd_share
find nothing, no sharing
find nothing, no sharing
pmd_alloc pmd_alloc

These two processes are not poing to the same data page but are not
sharing page tables because the opportunity was missed. When either
process later forks, the src_pte == dst pte is potentially insufficient.
As the check falls through, the wrong PTE information is copied in
(harmless but wrong) and the mapcount is bumped for a page mapped by a
shared page table leading to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.

Race identified and changelog written mostly by Mel Gorman.

{akpm@xxxxxxxxxxxxxxxxxxxx: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
Reported-by: Larry Woodman <lwoodman@xxxxxxxxxx>
Tested-by: Larry Woodman <lwoodman@xxxxxxxxxx>
Reviewed-by: Mel Gorman <mgorman@xxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxx>
Reviewed-by: Rik van Riel <riel@xxxxxxxxxx>
Cc: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
Cc: Ken Chen <kenchen@xxxxxxxxxx>
Cc: Cong Wang <xiyou.wangcong@xxxxxxxxx>
Cc: Hillf Danton <dhillf@xxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

arch/x86/mm/hugetlbpage.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)

--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -56,9 +56,16 @@ static int vma_shareable(struct vm_area_

- * search for a shareable pmd page for hugetlb.
+ * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
+ * and returns the corresponding pte. While this is not necessary for the
+ * !shared pmd case because we can allocate the pmd later as well, it makes the
+ * code much cleaner. pmd allocation is essential for the shared case because
+ * pud has to be populated inside the same i_mmap_mutex section - otherwise
+ * racing tasks could either miss the sharing (see huge_pte_offset) or select a
+ * bad pmd for sharing.
-static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
+static pte_t *
+huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
struct vm_area_struct *vma = find_vma(mm, addr);
struct address_space *mapping = vma->vm_file->f_mapping;
@@ -68,9 +75,10 @@ static void huge_pmd_share(struct mm_str
struct vm_area_struct *svma;
unsigned long saddr;
pte_t *spte = NULL;
+ pte_t *pte;

if (!vma_shareable(vma, addr))
- return;
+ return (pte_t *)pmd_alloc(mm, pud, addr);

vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
@@ -97,7 +105,9 @@ static void huge_pmd_share(struct mm_str
+ pte = (pte_t *)pmd_alloc(mm, pud, addr);
+ return pte;

@@ -142,8 +152,9 @@ pte_t *huge_pte_alloc(struct mm_struct *
} else {
if (pud_none(*pud))
- huge_pmd_share(mm, addr, pud);
- pte = (pte_t *) pmd_alloc(mm, pud, addr);
+ pte = huge_pmd_share(mm, addr, pud);
+ else
+ pte = (pte_t *)pmd_alloc(mm, pud, addr);
BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/