Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr

From: Yang, Bin
Date: Mon Jul 09 2018 - 22:18:53 EST


On Wed, 2018-07-04 at 16:01 +0200, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Yang, Bin wrote:
> > e820 table:
> > =================
> >
> > [ 0.000000] BIOS-e820: [mem 0x0000000000000000-
> > 0x000000000009fbff]
> > usable
> > [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-
> > 0x000000000009ffff]
> > reserved
> > [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-
> > 0x00000000000fffff]
> > reserved
> > [ 0.000000] BIOS-e820: [mem 0x0000000000100000-
> > 0x00000000bffdffff]
> > usable
> > [ 0.000000] BIOS-e820: [mem 0x00000000bffe0000-
> > 0x00000000bfffffff]
> > reserved
> > [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-
> > 0x00000000feffffff]
> > reserved
> > [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-
> > 0x00000000ffffffff]
> > reserved
> > [ 0.000000] BIOS-e820: [mem 0x0000000100000000-
> > 0x000000013fffffff]
> > usable
> >
> > call chain:
> > ======================
> >
> > ...
> > => free_init_pages(what="initrd" or "unused kernel",
> > begin=ffff9b26b....000, end=ffff9b26c....000); begin and end
> > addresses
> > are random. The begin/end value above is just for reference.
> >
> > => set_memory_rw()
> > => change_page_attr_set()
> > => change_page_attr_set_clr()
> > => __change_page_attr_set_clr(); cpa->numpages is 512 on my board
> > if
> > what=="unused kernel"
> > => __change_page_attr()
> > => try_preserve_large_page(); address=ffff9b26bfacf000, pfn=80000,
> > level=3; and the check loop count is 262144, exit loop after 861
> > usecs
> > on my board
>
> You are talking about the static_protections() check, right?
>
> > the actual problem
> > ===================
> > sometimes, free_init_pages returns after hundreds of secounds. The
> > major impact is kernel boot time.
>
> That's the symptom you are observing. The problem is in the
> try_to_preserve_large_page() logic.
>
> The address range from the example above is:
>
> 0xffff9b26b0000000 - 0xffff9b26c0000000
>
> i.e. 256 MB.
>
> So the first call with address 0xffff9b26b0000000 will try to
> preserve the
> 1GB page, but it's obvious that if pgrot changes that this has to
> split the
> 1GB page.
>
> The current code is stupid in that regard simply because it does the
> static_protection() check loop _before_ checking:
>
> 1) Whether the new and the old pgprot are the same
>
> 2) Whether the address range which needs to be changed is aligned
> to and
> covers the full large mapping
>
> So it does the static_protections() loop before #1 and #2 and checks
> the
> full 1GB page wise, which makes it loop 262144 times.
>
> With your magic 'cache' logic this will still loop exactly 262144
> times at
> least on the first invocation because there is no valid information
> in that
> 'cache'. So I really have no idea how your patch makes any difference
> unless it is breaking stuff left and right in very subtle ways.
>
> If there is a second invocation with the same pgprot on that very
> same
> range, then I can see it somehow having that effect by chance, but
> not by
> design.
>
> But this is all voodoo programming and there is a very obvious and
> simple
> solution for this:
>
> The check for pgprot_val(new_prot) == pgprot_val(old_port) can
> definitely
> be done _before_ the check loop. The check loop is pointless in
> that
> case, really. If there is a mismatch anywhere then it existed
> already and
> instead of yelling loudly the checking would just discover it,
> enforce
> the split and that would in the worst case preserve the old wrong
> mapping
> for those pages.
>
> What we want there is a debug mechanism which catches such cases,
> but is
> not effective on production kernels and runs once or twice during
> boot.
>
> The range check whether the address is aligned to the large page
> and
> covers the full large page (1G or 2M) is also obvious to do
> _before_ that
> check, because if the requested range does not fit and has a
> different
> pgprot_val() then it will decide to split after the check anyway.
>
> The check loop itself is stupid as well. Instead of looping in 4K
> steps
> the thing can be rewritten to check for overlapping ranges and then
> check
> explicitely for those. If there is no overlap in the first place
> the
> whole loop thing can be avoided, but that's a pure optimization and
> needs
> more thought than the straight forward and obvious solution to the
> problem at hand.
>
> Untested patch just moving the quick checks to the obvious right
> place
> below.
>
> Thanks,
>
> tglx
>
> 8<-----------------
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -628,47 +628,61 @@ try_preserve_large_page(pte_t *kpte, uns
> new_prot = static_protections(req_prot, address, pfn);
>
> /*
> - * We need to check the full range, whether
> - * static_protection() requires a different pgprot for one
> of
> - * the pages in the range we try to preserve:
> + * If there are no changes, return. cpa->numpages has been
> updated
> + * above.
> + *
> + * There is no need to do any of the checks below because
> the
> + * existing pgprot of the large mapping has to be correct.
> If it's
> + * not then this is a bug in some other code and needs to be
> fixed
> + * there and not silently papered over by the
> static_protections()
> + * magic and then 'fixed' up in the wrong way.
> */
> - addr = address & pmask;
> - pfn = old_pfn;
> - for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr +=
> PAGE_SIZE, pfn++) {
> - pgprot_t chk_prot = static_protections(req_prot,
> addr, pfn);
> -
> - if (pgprot_val(chk_prot) != pgprot_val(new_prot))
> - goto out_unlock;
> + if (pgprot_val(new_prot) == pgprot_val(old_prot)) {
> + do_split = 0;
> + goto out_unlock;
> }
>
> /*
> - * If there are no changes, return. maxpages has been
> updated
> - * above:
> + * If the requested address range is not aligned to the
> start of
> + * the large page or does not cover the full range, split it
> up.
> + * No matter what the static_protections() check below does,
> it
> + * would anyway result in a split after doing all the check
> work
> + * for nothing.
> */
> - if (pgprot_val(new_prot) == pgprot_val(old_prot)) {
> - do_split = 0;
> + addr = address & pmask;
> + numpages = psize >> PAGE_SHIFT;
> + if (address != addr || cpa->numpages != numpages)
> goto out_unlock;
> - }
>
> /*
> - * We need to change the attributes. Check, whether we can
> - * change the large page in one go. We request a split, when
> - * the address is not aligned and the number of pages is
> - * smaller than the number of pages in the large page. Note
> - * that we limited the number of possible pages already to
> - * the number of pages in the large page.
> + * Check the full range, whether static_protection()
> requires a
> + * different pgprot for one of the pages in the existing
> large
> + * mapping.
> + *
> + * FIXME:
> + * 1) Make this yell loudly if something tries to set a full
> + * large page to something which is not allowed.

I am trying to work out a patch based on your sample code and
comments.
I do not understand here why it needs to yell loudly if set a full
large page to something which is not allowed. It can split the large
page to 4K-pages finally. And static_protection() will also be called
for 4K-page change. Why not just add warning if 4K-page change is not
allowed?

> + * 2) Add a check for catching a case where the existing
> mapping
> + * is wrong already.
> + * 3) Instead of looping over the whole thing, find the
> overlapping
> + * ranges and check them explicitely instead of looping
> over a
> + * full 1G mapping in 4K steps (256k iterations) for
> figuring
> + * that out.
> */
> - if (address == (address & pmask) && cpa->numpages == (psize
> >> PAGE_SHIFT)) {
> - /*
> - * The address is aligned and the number of pages
> - * covers the full page.
> - */
> - new_pte = pfn_pte(old_pfn, new_prot);
> - __set_pmd_pte(kpte, address, new_pte);
> - cpa->flags |= CPA_FLUSHTLB;
> - do_split = 0;
> + pfn = old_pfn;
> + for (i = 0; i < numpages; i++, addr += PAGE_SIZE, pfn++) {
> + pgprot_t chk_prot = static_protections(req_prot,
> addr, pfn);
> +
> + if (pgprot_val(chk_prot) != pgprot_val(new_prot))
> + goto out_unlock;
> }
>
> + /* All checks passed. Just change the large mapping entry */
> + new_pte = pfn_pte(old_pfn, new_prot);
> + __set_pmd_pte(kpte, address, new_pte);
> + cpa->flags |= CPA_FLUSHTLB;
> + do_split = 0;
> +
> out_unlock:
> spin_unlock(&pgd_lock);
>
>
>
>
>