Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump

From: Dev Jain
Date: Tue Jun 17 2025 - 04:59:10 EST



On 17/06/25 1:42 pm, Ryan Roberts wrote:
On 17/06/2025 04:59, Dev Jain wrote:
On 17/06/25 8:24 am, Dev Jain wrote:
On 16/06/25 11:37 pm, Ryan Roberts wrote:
On 16/06/2025 11:33, Dev Jain wrote:
arm64 disables vmalloc-huge when kernel page table dumping is enabled,
because an intermediate table may be removed, potentially causing the
ptdump code to dereference an invalid address. We want to be able to
analyze block vs page mappings for kernel mappings with ptdump, so to
enable vmalloc-huge with ptdump, synchronize between page table removal in
pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
use mmap_read_lock and not write lock because we don't need to synchronize
between two different vm_structs; two vmalloc objects running this same
code path will point to different page tables, hence there is no race.

For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
512 times again via pmd_free_pte_page().

We implement the locking mechanism using static keys, since the chance
of a race is very small. Observe that the synchronization is needed
to avoid the following race:

CPU1                            CPU2
                        take reference of PMD table
pud_clear()
pte_free_kernel()
                        walk freed PMD table

and similar race between pmd_free_pte_page and ptdump_walk_pgd.

Therefore, there are two cases: if ptdump sees the cleared PUD, then
we are safe. If not, then the patched-in read and write locks help us
avoid the race.

To implement the mechanism, we need the static key access from mmu.c and
ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
target in the Makefile, therefore we cannot initialize the key there, as
is being done, for example, in the static key implementation of
hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
the jump_label mechanism. Declare the key there and define the key to false
in mmu.c.

No issues were observed with mm-selftests. No issues were observed while
parallelly running test_vmalloc.sh and dumping the kernel pagetable through
sysfs in a loop.

v2->v3:
  - Use static key mechanism

v1->v2:
  - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
  - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
    the lock 512 times again via pmd_free_pte_page()

Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
---
  arch/arm64/include/asm/cpufeature.h |  1 +
  arch/arm64/mm/mmu.c                 | 51 ++++++++++++++++++++++++++---
  arch/arm64/mm/ptdump.c              |  5 +++
  3 files changed, 53 insertions(+), 4 deletions(-)

[...]

+    pud_clear(pudp);
How can this possibly be correct; you're clearing the pud without any
synchronisation. So you could have this situation:

CPU1 (vmalloc)            CPU2 (ptdump)

                static_branch_enable()
                  mmap_write_lock()
                    pud = pudp_get()
When you do pudp_get(), you won't be dereferencing a NULL pointer.
pud_clear() will nullify the pud entry. So pudp_get() will boil
down to retrieving a NULL entry. Or, pudp_get() will retrieve an
entry pointing to the now isolated PMD table. Correct me if I am
wrong.

pud_free_pmd_page()
   pud_clear()
                    access the table pointed to by pud
                    BANG!
I am also confused thoroughly now : ) This should not go bang as the

table pointed to by pud is still there, and our sequence guarantees that

if the ptdump walk is using the pmd table, then pud_free_pmd_page won't

free the PMD table yet.
You're right... I'm not sure what I was smoking last night. For some reason I
read the pXd_clear() as "free". This approach looks good to me - very clever!
And you even managed to ensure the WRITE_ONCE() in pXd_clear() doesn't get
reordered after taking the lock via the existing dsb in the tlb maintenance
operation - I like it!

Haha! It indeed was very confusing, the important observation separating this
from other cases was that ptdump only cares about reading the tables, not about
what it reads.


I'll send a separate review with some nits, but I'm out today, so that might
have to wait until tomorrow.

Thanks, and sorry again for the noise!

Ah no it was not noise : ) Sure, enjoy.

Ryan