Re: pipe/page fault oddness.

From: Linus Torvalds
Date: Wed Oct 01 2014 - 12:01:48 EST


On Wed, Oct 1, 2014 at 1:19 AM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
>
> Irrelevance follows...

Maybe not irrelevant.

> There *appears* to be a risk of hitting the VM_BUG_ON, or with no
> VM_BUG_ON (as in 3.17-rc) pte_mknuma proceeding to add _PAGE_NUMA
> to _PAGE_PROTNONE - making the pte then fail the pte_numa test,
> but pass the pte_special test, hence fail the vm_normal_page test:
> when coming from change_prot_numa serving MPOL_MF_LAZY for mbind.

Ugh, yes. The whole _PAGE_NUMA is still a f*cking mess. I hate it.
Hate it, hate it, hate it.

We need to get rid of it, and just make it the same as pte_protnone().
And then the real protnone is in the vma flags, and if you actually
ever get to a pte that is marked protnone, you know it's a numa page.

Seriously.

I never understood what the objection to that was, but every time I
tell people to do it, they go crazy and think _PAGE_NUMA makes sense.
It doesn't. There's no excuse. Rik, what was your broken excuse again?
Something to do with Powerpc, but it is obviously not true, since
powerpc supports protnone just fine.

Even our own comments are confused, with include/asm-generic/pgtable.h saying:

* _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
* same bit too).

but no, it's not the same bit.

Can we please just get rid of _PAGE_NUMA. There is no excuse for it.

> However, that would still not explain Dave's endless refaulting;

Why not? You start out with a PROTNONE, trigger shrink_page_list() on
a hugepage,.which calls add_to_swap(), which does
split_huge_page_to_list(), which in turn calls __split_huge_page(),
and that turns (_PAGE_PROTNONE) into (_PAGE_PROTNONE|_PAGE_NUMA),
which you will then fault on forever, because the kernel thinks the
page is present, but not a NUMA page.

IOW, it's *exactly* the same f*cking confusion between _PAGE_NUMA and
_PAGE_PROTNONE I've complained about before.

> Some time wasted on that, but I learnt a valuable debugging technique:
> #undef EINVAL
> #define EINVAL __LINE__

Wow. There's a certain beauty in the pure crazyness.

However, be careful: our IS_ERR() handling knows that error numbers
are < 4096. So on a big file, and error pointers, that doesn't work
reliably.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/