Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults racewith PTE scan update

From: Mel Gorman
Date: Fri Dec 06 2013 - 04:24:17 EST


On Thu, Dec 05, 2013 at 03:05:19PM -0500, Rik van Riel wrote:
> On 12/05/2013 02:54 PM, Mel Gorman wrote:
>
> >I think that's a better fit and a neater fix. Thanks! I think it barriers
> >more than it needs to (definite cost vs maybe cost), the flush can be
> >deferred until we are definitely trying to migrate and the pte case is
> >not guaranteed to be flushed before migration due to pte_mknonnuma causing
> >a flush in ptep_clear_flush to be avoided later. Mashing the two patches
> >together yields this.
>
> I think this would fix the numa migrate case.
>

Good. So far I have not been seeing any problems with it at least.

> However, I believe the same issue is also present in
> mprotect(..., PROT_NONE) vs. compaction, for programs
> that trap SIGSEGV for garbage collection purposes.
>

I'm not 100% convinced we need to be concerned with races with
mprotect(PROT_NONE) and a parallel reference to that area from userspace. I
would consider it to be a buggy application if two threads were not
co-ordinating the protection of a region and referencing it. I would also
expect garbage collectors to be managing smart pointers and using reference
counting to copy between heap generations (or similar mechanisms) instead
of trapping sigsegv.

Intel's architectural manual 3A covers what happens for delayed TLB
invalidations in section 4.10.4.4 (in the version I'm looking at at
least). The following two snippets are the most important

Software developers should understand that, between the modification
of a paging-structure entry and execution of the invalidation
instruction recommended in Section 4.10.4.2, the processor may
use translations based on either the old value or the new value
of the paging- structure entry. The following items describe some
of the potential consequences of delayed invalidation:

o If a paging-structure entry is modified to change from 1 to 0 the P
flag from 1 to 0, an access to a linear address whose translation is
controlled by this entry may or may not cause a page-fault exception.

o If a paging-structure entry is modified to change the R/W flag
from 0 to 1, write accesses to linear addresses whose translation is
controlled by this entry may or may not cause a page-fault exception.

After the PROT_NONE may happen until after the deferred TLB flush. In a
race with mprotect(PROT_NONE) it'll either complete the access or receive
SIGSEGV signal due to failed protections but this is pretty much
expected and unpredictable.

I do not think the present bit gets cleared on mprotect(PROT_NONE) due
to the relevant bits been

#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY)
#define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)

If the present bit remains then compaction should flush the TLB on the
call to ptep_clear_flush as pte_accessible check is based on the present
bit. So even though it is possible for a write to complete during a call
to mprotect(PROT_NONE), the same is not true for compaction.

> They could lose modifications done in-between when
> the pte was set to PROT_NONE, and the actual TLB
> flush, if compaction moves the page around in-between
> those two events.
>
> I don't know if this is a case we need to worry about
> at all, but I think the same fix would apply to that
> code path, so I guess we might as well make it...

I might be going "la la la la we're fine" and deluding myself but we
appear to be covered here and it would be a shame to add expense to a
path unnecessarily.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/