Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

From: Dev Jain
Date: Thu Jul 31 2025 - 03:13:16 EST

Next message: Yu Kuai: "Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock"
Previous message: kernel test robot: "[linux-next:master] [perf stat] faebee18d7: perf-sanity-tests.perf_stat_CSV_output_linter.fail"
In reply to: Catalin Marinas: "Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump"
Next in thread: Catalin Marinas: "Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 30/07/25 10:30 pm, Catalin Marinas wrote:

On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 12f534e8f3ed..e835fd437ae0 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
/*
* SW table walks can't handle removal of intermediate entries.
*/
- return pud_sect_supported() &&
- !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+ return pud_sect_supported();

Does the "SW table walks..." comment still make sense?

Anshuman pointed that out, I missed dropping that.

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 00ab1d648db6..49932c1dd34e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c

[...]

@@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
}
table = pmd_offset(pudp, addr);
+
+ /*
+ * Our objective is to prevent ptdump from reading a PMD table which has
+ * been freed. Assume that ptdump_walk_pgd() (call this thread T1)
+ * executes completely on CPU1 and pud_free_pmd_page() (call this thread
+ * T2) executes completely on CPU2. Let the region sandwiched by the

I think "executes completely on CPU*" is confusing. Just talk about
threads as they can be migrated between CPUs and the memory model is
preserved.

Okay.

+ * mmap_write_lock/unlock in T1 be called CS (the critical section).
+ *
+ * Claim: The CS of T1 will never operate on a freed PMD table.
+ *
+ * Proof:
+ *
+ * Case 1: The static branch is visible to T2.
+ *
+ * Case 1 (a): T1 acquires the lock before T2 can.
+ * T2 will block until T1 drops the lock, so pmd_free() will only be
+ * executed after T1 exits CS.

This assumes that there is some ordering between unlock and pmd_free()
(e.g. some poisoning of the old page). The unlock only gives us release
semantics, not acquire. It just happens that we have an atomic
dec-and-test down the __free_pages() path but I'm not convinced we
should rely on it unless free_pages() has clear semantics on ordering
related to prior memory writes.

I replied in the other mail to Ryan, hopefully that logic is satisfactory.

+ *
+ * Case 1 (b): T2 acquires the lock before T1 can.
+ * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
+ * ensures that an empty PUD (via pud_clear()) is visible to T1 before
+ * T1 can enter CS, therefore it is impossible for the CS to get hold
+ * of the address of the isolated PMD table.

Is this the effect of the barriers in the TLB flushing or the release
semantics of the unlock?

Yeah you are right :) the release semantics of unlock guarantee that
an empty PUD will be observed before unlock, and the unlock
will have to happen before T1 can enter CS - the associativity leads us
to the conclusion.

+ *
+ * Case 2: The static branch is not visible to T2.
+ *
+ * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
+ * have acquire semantics, it is guaranteed that the static branch

Does static_branch_disable() have release semantics? I think it does via
some atomic_cmpxchg() but it's worth spelling it out.

Yes it *should* have, but this proof is already getting complicated so I thought,
let's not mention anything which is not required in the reasoning of the proof :)

+ * will be visible to all CPUs before T1 can enter CS. The static

As I mentioned on a previous version, this visibility is not absolute.
You could say that it will be observed by other CPUs before they observe
the write_lock.

Hmm. I probably am bad at explaining all of this in English :)

+ * branch not being visible to T2 therefore guarantees that T1 has
+ * not yet entered CS .... (i)
+ * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
+ * implies that if the invisibility of the static branch has been
+ * observed by T2 (i.e static_branch_unlikely() is observed as false),
+ * then all CPUs will have observed an empty PUD ... (ii)
+ * Combining (i) and (ii), we conclude that T1 observes an empty PUD
+ * before entering CS => it is impossible for the CS to get hold of
+ * the address of the isolated PMD table. Q.E.D

OK, I nearly got lost. That's not a straightforward memory ordering with
as we have instruction updates with ISB as part of the TLB flushing. I
concluded last time I looked that this branch patching part works
because we also have kick_all_cpus_sync() as part of the static branch
update.

Stepping back to a simpler model, let's say that the static key is a
variable. I wrote this quick test for herd7 and the Linux kernel memory
model (fairly easy to play with):

-------------------------------------8<---------------------------------------
C pte_free_ptdump

(*
* $ cd tools/memory-model
* $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
*)

{
pmd = 1; (* allocated pmd *)
pte_page = 1; (* valid pte page pointed at by the pmd *)
}

// pmd_free_pte_page()
P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
{
WRITE_ONCE(*pmd, 0); // pmd_clear()
smp_mb();
if (READ_ONCE(*ptdump_lock_key)) { // static_branch() approximation
spin_lock(init_mm); // mmap_read_lock()
spin_unlock(init_mm); // mmap_read_unlock()
}
smp_mb();
WRITE_ONCE(*pte_page, 0); // pte_free_kernel()
}

// ptdump_walk_pgd()
P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
{
int val;
int page;

WRITE_ONCE(*ptdump_lock_key, 1); // static_branch_enable()
smp_mb();
spin_lock(init_mm); // mmap_write_lock()
val = READ_ONCE(*pmd);
page = READ_ONCE(*pte_page); // freed pte page?
spin_unlock(init_mm); // mmap_write_unlock()
smp_mb();
WRITE_ONCE(*ptdump_lock_key, 0); // static_branch_disable()
}

exists(1:val=1 /\ 1:page=0)
-------------------------------------8<---------------------------------------

I sprinkled some necessary smp_mb() but in most cases we have
release/acquire semantics. It does show that we need a barrier before
the page freeing. We also need to acquire for enabling the static key
and release for disabling it.

Next step is to assess that replacing the static key variable read/write
with code patching preserves the model.

Thanks for the litmus test!
I ran it and I get the same result. But I still cannot figure out where is the hole
in my proof. I think you raise an objection to Case 2:

T1: T2:

WRITE_ONCE(*pmd, 0) : t1 WRITE_ONCE(*ptdump_lock_key, 1): t3
smp_mb() // flush_tlb_kernel_pgtable cmpxchg(acquire write lock)
smp_mb()
READ_ONCE(*ptdump_lock_key) = 0 : t2 val = READ_ONCE(*pmd) : t4
WRITE_ONCE(*pte_page, 0) page = READ_ONCE(*pte_page)
smp_mb()
cmpxchg(release write lock)

Let t_i be the global timestamps of the execution of the labelled instructions
(note that each labelled instruction is atomic so assigning a timestamp makes sense).

The fact that we read ptdump_lock_key as 0 implies that
t2 < t3 ... (i)
Now,
t1 < t2 because of barrier (ii)
t3 < t4 because of barrier (iii)

So we conclude that t1 < t4, so val will be observed as 0.

Alright I am horribly confused now :)

Next message: Yu Kuai: "Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock"
Previous message: kernel test robot: "[linux-next:master] [perf stat] faebee18d7: perf-sanity-tests.perf_stat_CSV_output_linter.fail"
In reply to: Catalin Marinas: "Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump"
Next in thread: Catalin Marinas: "Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]