Re: [RFC] page-table walkers vs memory order

From: Paul E. McKenney
Date: Mon Jul 23 2012 - 15:28:09 EST


On Mon, Jul 23, 2012 at 07:34:30PM +0200, Peter Zijlstra wrote:
>
> While staring at mm/huge_memory.c I found a very under-commented
> smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> thereto).
>
> Now, afaict we're not good, as per that comment. Paul has since
> convinced some of us that compiler writers are pure evil and out to get
> us.

I have seen the glint in their eyes when they discuss optimization
techniques that you would not want your children to know about!

> Therefore we should do what rcu_dereference() does and use
> ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> we dereference a page-table pointer.
>
>
> In particular it looks like things like
> mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> walk_page_range() under down_read(&mm->mmap_sem) and can thus be
> concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> itself) are quite broken on Alpha and subtly broken for those of us with
> 'creative' compilers.

Looks good to me, though given my ignorance of mm, not sure that counts
for much.

Thanx, Paul

> Should I go do a more extensive audit of page-table walkers or are we
> happy with the status quo?
>
> ---
> arch/x86/mm/gup.c | 6 +++---
> mm/pagewalk.c | 24 ++++++++++++++++++++++++
> 2 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dd74e46..4958fb1 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>
> pmdp = pmd_offset(&pud, addr);
> do {
> - pmd_t pmd = *pmdp;
> + pmd_t pmd = ACCESS_ONCE(*pmdp);

Here, ACCESS_ONCE() suffices because this is x86-specific code.
Core code would need to worry about Alpha, and would thus also need
smp_read_barrier_depends(), as you in fact do have further down.

> next = pmd_addr_end(addr, end);
> /*
> @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>
> pudp = pud_offset(&pgd, addr);
> do {
> - pud_t pud = *pudp;
> + pud_t pud = ACCESS_ONCE(*pudp);
>
> next = pud_addr_end(addr, end);
> if (pud_none(pud))
> @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> local_irq_save(flags);
> pgdp = pgd_offset(mm, addr);
> do {
> - pgd_t pgd = *pgdp;
> + pgd_t pgd = ACCESS_ONCE(*pgdp);
>
> next = pgd_addr_end(addr, end);
> if (pgd_none(pgd))
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6c118d0..2ba2a74 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> int err = 0;
>
> pte = pte_offset_map(pmd, addr);
> + /*
> + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> + * actual dereference of p[gum]d, but that's hidden deep within the
> + * bowels of {pte,pmd,pud}_offset.
> + */
> + barrier();
> + smp_read_barrier_depends();
> for (;;) {
> err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> if (err)
> @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> int err = 0;
>
> pmd = pmd_offset(pud, addr);
> + /*
> + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> + * actual dereference of p[gum]d, but that's hidden deep within the
> + * bowels of {pte,pmd,pud}_offset.
> + */
> + barrier();
> + smp_read_barrier_depends();
> do {
> again:
> next = pmd_addr_end(addr, end);
> @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> int err = 0;
>
> pud = pud_offset(pgd, addr);
> + /*
> + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> + * actual dereference of p[gum]d, but that's hidden deep within the
> + * bowels of {pte,pmd,pud}_offset.
> + */
> + barrier();
> + smp_read_barrier_depends();
> do {
> next = pud_addr_end(addr, end);
> if (pud_none_or_clear_bad(pud)) {
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/