Re: [ 08/82] mm: pmd_read_atomic: fix 32bit PAE pmd walk vspmd_populate SMP race condition

From: Konrad Rzeszutek Wilk
Date: Thu Jun 07 2012 - 14:00:07 EST


On Thu, Jun 07, 2012 at 01:03:44PM +0900, Greg KH wrote:
> 3.4-stable review patch. If anyone has any objections, please let me know.

It breaks Linux running under Amazon EC2 under 32-bit. Please
don't apply it to any 3.x kernels until we figure out a
fix to this.


>
> ------------------
>
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>
> commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream.
>
> When holding the mmap_sem for reading, pmd_offset_map_lock should only
> run on a pmd_t that has been read atomically from the pmdp pointer,
> otherwise we may read only half of it leading to this crash.
>
> PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
> #0 [f06a9dd8] crash_kexec at c049b5ec
> #1 [f06a9e2c] oops_end at c083d1c2
> #2 [f06a9e40] no_context at c0433ded
> #3 [f06a9e64] bad_area_nosemaphore at c043401a
> #4 [f06a9e6c] __do_page_fault at c0434493
> #5 [f06a9eec] do_page_fault at c083eb45
> #6 [f06a9f04] error_code (via page_fault) at c083c5d5
> EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
> 00000000
> DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
> CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
> #7 [f06a9f38] _spin_lock at c083bc14
> #8 [f06a9f44] sys_mincore at c0507b7d
> #9 [f06a9fb0] system_call at c083becd
> start len
> EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
> DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
> SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
> CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
>
> This should be a longstanding bug affecting x86 32bit PAE without THP.
> Only archs with 64bit large pmd_t and 32bit unsigned long should be
> affected.
>
> With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
> would partly hide the bug when the pmd transition from none to stable,
> by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
> enabled a new set of problem arises by the fact could then transition
> freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
> So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
> unconditional isn't good idea and it would be a flakey solution.
>
> This should be fully fixed by introducing a pmd_read_atomic that reads
> the pmd in order with THP disabled, or by reading the pmd atomically
> with cmpxchg8b with THP enabled.
>
> Luckily this new race condition only triggers in the places that must
> already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
> is localized there but this bug is not related to THP.
>
> NOTE: this can trigger on x86 32bit systems with PAE enabled with more
> than 4G of ram, otherwise the high part of the pmd will never risk to be
> truncated because it would be zero at all times, in turn so hiding the
> SMP race.
>
> This bug was discovered and fully debugged by Ulrich, quote:
>
> ----
> [..]
> pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
> eax.
>
> 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
> *pmd)
> 497 {
> 498 /* depend on compiler for an atomic pmd read */
> 499 pmd_t pmdval = *pmd;
>
> // edi = pmd pointer
> 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
> ...
> // edx = PTE page table high address
> 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
> ...
> // eax = PTE page table low address
> 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
>
> [..]
>
> Please note that the PMD is not read atomically. These are two "mov"
> instructions where the high order bits of the PMD entry are fetched
> first. Hence, the above machine code is prone to the following race.
>
> - The PMD entry {high|low} is 0x0000000000000000.
> The "mov" at 0xc0507a84 loads 0x00000000 into edx.
>
> - A page fault (on another CPU) sneaks in between the two "mov"
> instructions and instantiates the PMD.
>
> - The PMD entry {high|low} is now 0x00000003fda38067.
> The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
> ----
>
> Reported-by: Ulrich Obergfell <uobergfe@xxxxxxxxxx>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Larry Woodman <lwoodman@xxxxxxxxxx>
> Cc: Petr Matousek <pmatouse@xxxxxxxxxx>
> Cc: Rik van Riel <riel@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>
> ---
> arch/x86/include/asm/pgtable-3level.h | 50 ++++++++++++++++++++++++++++++++++
> include/asm-generic/pgtable.h | 22 +++++++++++++-
> 2 files changed, 70 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -31,6 +31,56 @@ static inline void native_set_pte(pte_t
> ptep->pte_low = pte.pte_low;
> }
>
> +#define pmd_read_atomic pmd_read_atomic
> +/*
> + * pte_offset_map_lock on 32bit PAE kernels was reading the pmd_t with
> + * a "*pmdp" dereference done by gcc. Problem is, in certain places
> + * where pte_offset_map_lock is called, concurrent page faults are
> + * allowed, if the mmap_sem is hold for reading. An example is mincore
> + * vs page faults vs MADV_DONTNEED. On the page fault side
> + * pmd_populate rightfully does a set_64bit, but if we're reading the
> + * pmd_t with a "*pmdp" on the mincore side, a SMP race can happen
> + * because gcc will not read the 64bit of the pmd atomically. To fix
> + * this all places running pmd_offset_map_lock() while holding the
> + * mmap_sem in read mode, shall read the pmdp pointer using this
> + * function to know if the pmd is null nor not, and in turn to know if
> + * they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
> + * operations.
> + *
> + * Without THP if the mmap_sem is hold for reading, the
> + * pmd can only transition from null to not null while pmd_read_atomic runs.
> + * So there's no need of literally reading it atomically.
> + *
> + * With THP if the mmap_sem is hold for reading, the pmd can become
> + * THP or null or point to a pte (and in turn become "stable") at any
> + * time under pmd_read_atomic, so it's mandatory to read it atomically
> + * with cmpxchg8b.
> + */
> +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> +{
> + pmdval_t ret;
> + u32 *tmp = (u32 *)pmdp;
> +
> + ret = (pmdval_t) (*tmp);
> + if (ret) {
> + /*
> + * If the low part is null, we must not read the high part
> + * or we can end up with a partial pmd.
> + */
> + smp_rmb();
> + ret |= ((pmdval_t)*(tmp + 1)) << 32;
> + }
> +
> + return (pmd_t) { ret };
> +}
> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> +{
> + return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
> {
> set_64bit((unsigned long long *)(ptep), native_pte_val(pte));
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -446,6 +446,18 @@ static inline int pmd_write(pmd_t pmd)
> #endif /* __HAVE_ARCH_PMD_WRITE */
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +#ifndef pmd_read_atomic
> +static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> +{
> + /*
> + * Depend on compiler for an atomic pmd read. NOTE: this is
> + * only going to work, if the pmdval_t isn't larger than
> + * an unsigned long.
> + */
> + return *pmdp;
> +}
> +#endif
> +
> /*
> * This function is meant to be used by sites walking pagetables with
> * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
> @@ -459,11 +471,17 @@ static inline int pmd_write(pmd_t pmd)
> * undefined so behaving like if the pmd was none is safe (because it
> * can return none anyway). The compiler level barrier() is critically
> * important to compute the two checks atomically on the same pmdval.
> + *
> + * For 32bit kernels with a 64bit large pmd_t this automatically takes
> + * care of reading the pmd atomically to avoid SMP race conditions
> + * against pmd_populate() when the mmap_sem is hold for reading by the
> + * caller (a special atomic read not done by "gcc" as in the generic
> + * version above, is also needed when THP is disabled because the page
> + * fault can populate the pmd from under us).
> */
> static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> {
> - /* depend on compiler for an atomic pmd read */
> - pmd_t pmdval = *pmd;
> + pmd_t pmdval = pmd_read_atomic(pmd);
> /*
> * The barrier will stabilize the pmdval in a register or on
> * the stack so that it will stop changing under the code.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/