Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

From: Dev Jain
Date: Tue Jul 01 2025 - 12:23:10 EST

Next message: Akhil P Oommen: "[PATCH v2 2/3] dt-bindings: power: qcom,rpmpd: add Turbo L5 corner"
Previous message: Akhil P Oommen: "[PATCH v2 3/3] arm64: dts: qcom: x1e80100: Update GPU OPP table"
In reply to: siddhartha: "Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads"
Next in thread: Zi Yan: "Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 01/07/25 6:09 pm, Lorenzo Stoakes wrote:

On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@xxxxxxxx wrote:

🧩 1. Does the patch cause VMAs to be merged eventually?
You're correct: VMA merging only happens at mmap() time (via
__mmap_region()). What the patch affects is the behavior of
thp_get_unmapped_area_vmflags() before the mmap is placed.

[...]

📐 2. Why aren’t the VMAs mergeable before the patch?
Great question. Even if the VMA flags are identical, gaps introduced by
forced alignment from get_unmapped_area() break the precondition for
merging:

[...]

💡 4. Why this patch complements Rik’s rather than contradicts it:

I'm really perplexed as to why you felt the need to (seemingly via LLM)
reply with the explanation I've already provided here?...

There's errors in things you say here too.

With respect, please don't do this.

(I'm the co-maintainer of pretty much all the relevant code here and wrote
the VMA merge logic you're referring to.)

🤖 3. How does this impact AI workloads like Hugging Face Transformers?
Tokenization and dynamic batching create non-deterministic memory allocation
patterns:

Models like BERT and T5 dynamically allocate intermediate buffers per
token-length, batch size, and attention window.

Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often
512KB–1.8MB.

These allocations come in bursts — but due to forced alignment, the kernel
was placing them with artificial gaps, defeating THP eligibility entirely.

By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The
result is that:

a. VMAs remain adjacent → mergeable

b. Physical memory is contiguous → eligible for khugepaged collapse

c. THP utilization increases → fewer TLB misses → lower latency → higher
throughput

This is very useful information and it's appreciated! Let's not drown this
out with restatements of stuff already covered.

⚙️ 5. mTHP note
Although this patch doesn’t target mTHP directly, I believe a similar logic
tweak could apply there too — especially with shmem-backed workloads (common
in model servers using shared tensor memory). I’d be happy to help test any
changes proposed there to derive the consequent results.

Dev - could we hold off on any effort to do something like this until I've
had a chance to refactor THP somewhat? This is already a mess and I'd like
to avoid us piling on more complexity.

We can revisit this at a later stage.

Yes of course. I had run a small benchmark on a quick dumb patch I wrote and I
don't see any measurable perf improvement, probably because the highest THP order
getting chosen is always PMD size.

Out of curiosity, where do you plan to do the refactoring?

Thanks again for the detailed discussion. Let me know if you’d like a trace
or VMA map from a Hugging Face benchmarked run (happy to generate one
locally).

Thanks! Much appreciated.

Cheers, Lorenzo

Next message: Akhil P Oommen: "[PATCH v2 2/3] dt-bindings: power: qcom,rpmpd: add Turbo L5 corner"
Previous message: Akhil P Oommen: "[PATCH v2 3/3] arm64: dts: qcom: x1e80100: Update GPU OPP table"
In reply to: siddhartha: "Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads"
Next in thread: Zi Yan: "Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]