Re: pipe/page fault oddness.

From: Mel Gorman
Date: Thu Oct 02 2014 - 08:45:47 EST


On Wed, Oct 01, 2014 at 09:18:25AM -0700, Linus Torvalds wrote:
> On Wed, Oct 1, 2014 at 9:01 AM, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > We need to get rid of it, and just make it the same as pte_protnone().
> > And then the real protnone is in the vma flags, and if you actually
> > ever get to a pte that is marked protnone, you know it's a numa page.
>
> So I'd really suggest we do exactly that. Get rid of "pte_numa()"
> entirely, get rid of "_PAGE_[BIT_]NUMA" entirely, and instead add a
> "pte_protnone()" helper to check for the "protnone" case (which on x86
> is testing the _PAGE_PROTNONE bit, and on most other architectures is
> just testing that the page has no access rights).
>

Do not interpret the following as being against the idea of taking the
pte_protnone approach. This is intended to give background.

At the time the changes were made to the _PAGE_NUMA bits it was acknowledged
that a full move to prot_none was an option but it was not the preferred
solution at the time. It replaced one set of corner cases with another and
the last time like this time, there was considerable time pressure. The
VMA would be required to distinguish between a NUMA hinting fault and a
real prot_none bit. In most cases, we have the VMA now with the exception
of GUP. GUP would have to unconditionally go into the slow path to do the
VMA lookup. That is not likely to be a big of a problem but it was a concern.

In early implementations based on prot_none there were some VMA-based
protection checks that had higher overhead. At the time, there were severe
problems with overhead due to NUMA balancing and adding more was not
desirable. This has been addressed since with changes in multiple other
areas so it's much less of a concern now than it was. In the current shape,
these probably is not as much a problem as long as any check on pte_numa
was first guarded by a VMA check. One way of handling the corner cases
where would be to pass in the VMA where available and have a VM_BUG_ON that
fires if its a PROT_NONE VMA. That would catch problems during debugging
without adding overhead in the !debug case.

Going back to the start, the PTE bit was used as the approach due to
concerns that a pte_protnone helper would not work on all architectures,
ppc64 in particular. There was no PROT_NONE bit there and instead prot_none
protections rely on PAGE_USER not being set so it's inaccessible from
userspace. There was discussion at the time that this could conceivably be
broken from some sub-architectures but I don't recall the details. Looking
at the current shape and your patch, it's conceivable that the pte_protnone
could be implemented as a _PAGE_PRESENT && !_PAGE_USER check as long
as it was guarded by a VMA check which x86 requires anyway. Not sure
if that would work for PMDs as I'm not familiar with with ppc64 to tell
offhand. Alternatively, ppc64 would potentially use the bit currently used
for _PAGE_NUMA as a _PROT_NONE bit.

I finished a small series that cleans up a number of issues discovered
recently due to Sasha's testing. It passed light testing and NUMA balancing
works but I cannot comment if they help Dave or Sasha's bugs as I never
managed to reproduce them. I'll post it shortly after I sent this mail.

Again, I'm not opposed to the pte_protnone issue as many of the concerns
I had before have been addressed since or rendered redundant. However,
I won't be able to digest and/or finish your patch in a reasonable time
frame. I'm only partially in work at the moment due to sick (going back
to bed after this mail) and out for a good chunk of next week and the week
after. Minor comments from the patch though

1. ppc64 needs work. Added Aneesh to the cc so he's at least aware
2. That check in pte_offset_kernel can probably go away
3. Ideally, VM_BUG_ON checks should be added to the pte_protnone check to
ensure the VMA checks have already completed to avoid confusion between
NUMA hints and real prot_none protections
4. GUP on prot_none now always hits the slow path but can't think of a
case where that really matters.
5. SWP_OFFSET_SHIFT should be readjusted back if _PAGE_BIT_NUMA is
removed.
6. It probably should at least be rebased on top of "mm: remove
misleading ARCH_USES_NUMA_PROT_NONE" simply on the grounds that
it cleans up some cruft
7. At the risk of pissing you off, pte_protnone_numa might be cleared
at indicating why protnone is being checked at all

> Then we throw away "pte_mknuma()" and "pte_mknonnuma()" entirely,
> because they are brainless sh*t, and we just use
>
> ptent = ptep_modify_prot_start(mm, addr, pte);
> ptent = pte_modify(ptent, newprot);
> ptep_modify_prot_commit(mm, addr, pte, ptent);
>
> reliably instead (where for the mknuma case "newprot" is PROT_NONE,
> and for mknonnuma() it is vma->vm_page_prot. Yes, that means that you
> have to pass in the vma to those functions, but that just makes sense
> anyway.
>
> And if that means that we lose the numa flag on mprotect etc, nobody sane cares.
>

Losing a NUMA hinting fault due to operations like mprotect is not a concern.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/