Re: [RFC] futex: prevent endless loop on s390x with emulated hugepages

From: Martin Schwidefsky
Date: Tue Oct 13 2015 - 11:13:07 EST


On Tue, 13 Oct 2015 15:51:22 +0200
Vlastimil Babka <vbabka@xxxxxxx> wrote:

> On 10/13/2015 01:48 PM, Vlastimil Babka wrote:
> > On 09/28/2015 01:49 PM, Martin Schwidefsky wrote:
> >> On Thu, 24 Sep 2015 17:05:48 +0200
> >> Vlastimil Babka <vbabka@xxxxxxx> wrote:
> >
> > [...]
> >
> >>> However, __get_user_pages_fast() is still broken. The get_user_pages_fast()
> >>> wrapper will hide this in the common case. The other user of the __ variant
> >>> is kvm, which is mentioned as the reason for removal of emulated hugepages.
> >>> The call of page_cache_get_speculative() looks also broken in this scenario
> >>> on debug builds because of VM_BUG_ON_PAGE(PageTail(page), page). With
> >>> CONFIG_TINY_RCU enabled, there's plain atomic_inc(&page->_count) which also
> >>> probably shouldn't happen for a tail page...
> >>
> >> It boils down to __get_user_pages_fast being broken for emulated large pages,
> >> doesn't it? My preferred fix would be to get __get_user_page_fast to work
> >> in this case.
> >
> > I agree, but didn't know enough of the architecture to attempt such fix
> > :) Thanks!
> >
> >> For 3.12 a patch would look like this (needs more testing
> >> though):
> >
> > FWIW it works for me in the particular LTP test, but as you said, it
> > needs more testing and breaking stable would suck.
>
> I'm trying to break the patch on 3.12 with trinity, let's see...
> Tried also to review it, although it's unlikely I'll catch some
> s390x-specific gotchas. For example, can't say what the effect of
> _SEGMENT_ENTRY_CO removal will be - before, the bit was set for
> non-emulated hugepages, and now the same bit is set for emulated ones?
> Or if pmd_bad() was also broken before, and now isn't? But otherwise the
> change seems OK, besides some nitpick below.

The _SEGMENT_ENTRY_CO is the segment change-override bit. This allows the
machine to skip storage-key updates for the dirty bit. Linux uses the
storage keys only for KVM which does not allow any kind of large page
to be present. The latest PoP remove the change-override bits again,
it never had an effect. As the bit is ignored I can reuse it as the
software large page bit.

> >> @@ -103,7 +104,7 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
> >> unsigned long end, int write, struct page **pages, int *nr)
> >> {
> >> unsigned long next;
> >> - pmd_t *pmdp, pmd;
> >> + pmd_t *pmdp, pmd, pmd_orig;
> >>
> >> pmdp = (pmd_t *) pudp;
> >> #ifdef CONFIG_64BIT
> >> @@ -112,7 +113,7 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
> >> pmdp += pmd_index(addr);
> >> #endif
> >> do {
> >> - pmd = *pmdp;
> >> + pmd = pmd_orig = *pmdp;
> >> barrier();
> >> next = pmd_addr_end(addr, end);
> >> /*
> >> @@ -127,8 +128,9 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
> >> if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> >> return 0;
> >> if (unlikely(pmd_large(pmd))) {
> >> - if (!gup_huge_pmd(pmdp, pmd, addr, next,
> >> - write, pages, nr))
> >> + if (!gup_huge_pmd(pmdp, pmd_orig,
> >> + pmd_swlarge_deref(pmd),
> >> + addr, next, write, pages, nr))
> >> return 0;
> >> } else if (!gup_pte_range(pmdp, pmd, addr, next,
> >> write, pages, nr))
>
> The "pmd" variable isn't changed anywhere in this loop after the initial
> assignment, so the extra "pmd_orig" variable isn't needed.

That is true, I will remove the pmd_orig variable.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/