Re: [PATCH v4 3/9] mm: pagewalk: Don't split transhuge pmds when a pmd_entry is present

From: Thomas HellstrÃm (VMware)
Date: Wed Oct 09 2019 - 18:31:05 EST


On 10/9/19 10:20 PM, Linus Torvalds wrote:
On Wed, Oct 9, 2019 at 1:06 PM Thomas HellstrÃm (VMware)
<thomas_os@xxxxxxxxxxxx> wrote:
On 10/9/19 9:20 PM, Linus Torvalds wrote:
Don't you get it? There *is* no PTE level if you didn't split.
Hmm, This paragraph makes me think we have very different perceptions about what I'm trying to achieve.
It's not about what you're trying to achieve.

It's about the actual code.

You cannot do that

- split_huge_pmd(walk->vma, pmd, addr);
+ if (!ops->pmd_entry)
+ split_huge_pmd(walk->vma, pmd, addr);
it's insane.

You *have* to call split_huge_pmd() if you're doing to call the
pte_entry() function.

I don't understand why you are arguing. This is not about "feelings"
and "intentions" or about "trying to achieve".

This is about cold hard "you can't do that", and this is now the third
time I tell you _why_ you can't do that: you can't walk the last level
if you don't _have_ a last level. You have to split the pmd to do so.
It's not so much arguing but rather trying to understand your concerns and your perception of what the final code should look like.

End of story.

So is it that you want pte_entry() to be strictly called for *each* virtual address, even if we have a pmd_entry()?
In that case I completely follow your arguments, meaning we skip this patch completely?

My take on the change was that pmd_entry() returning 0 would mean we could actually skip the pte level completely and nothing would otherwise pass down to the next level for which split_huge_pmd() wasn't a NOP, similar to how pud_entry does things. FWIW, see

https://lore.kernel.org/lkml/20191004123732.xpr3vroee5mhg2zt@xxxxxxxxxxxxxxxxx/

and we could in the long run transform the pte walk in many pmd_entry callbacks into pte_entry callbacks.



I wanted the pte level to *only* get called for *pre-existing* pte entries.
Again, I told you what the solution was.

But the fact is, it's not what your other code even wants or does.

Seriously. You have two cases you care about in your callbacks

- an actual hugepmd. This is an ERROR for you and you do a huge
WARN_ON() for it to let people know.
No, it's typically a NOP, since the hugepmd should be read-only. Write-enabled huge pages are split in fault().

- regular pages. This is what your other code actually handles.

So for the hugepomd case, you have two choices:

- handle it by splitting and deal with the regular pages: "return 0;"
Well, this is not what we want to do, and the reason we have the patch in the first place.

- actually error out: "return -EINVAL".

/Thomas