Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

From: Dev Jain
Date: Mon Jun 30 2025 - 01:26:53 EST



On 30/06/25 6:13 am, siddhartha@xxxxxxxx wrote:
On 2025-06-28 09:19, Dev Jain wrote:
On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
+cc Vlata

On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@xxxxxxxx wrote:
Hi all,

I wanted to share validation data from a Hugging Face-based AI inferencing
workload,
which was significantly impacted by the THP alignment logic introduced in
commit efa7df3e3bb5.

Using transformer models with dynamic input lengths on Intel Xeon (Cooper
Lake),
we observed up to a 3200% throughput improvement after applying the patch
from Oct 2024:

   mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
All congratulations are owed to Vlastimil Babka for doing this, cc'd :)

I gather he enjoys novelty beer mugs as tokens of thanks ;)

I was wondering how the change can get us such a big optimization - the
alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
something else I am missing?

I ask because when I was reading the code I was thinking whether a similar
change can be done for mTHPs.


Metrics:
- Model: BERT-base
- Inference engine: Transformers + ONNX Runtime
- Kernel: 6.6 vs patched 6.6.8
- Batch size: 8-32, input length: 64-512 tokens
- Metric: inference throughput (samples/sec)

Thanks for the fix -- this change had real impact on a production-relevant
workload.

Best Regards,
Siddhartha Sharma
ISV @ Kenip
Solution Link: https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html


Hi Dev Jain,

Thank you for reviewing and for your thoughtful question.

You're absolutely right that, in isolation, gaining one additional PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case (Hugging Face inference workloads with dynamic input sizes and many allocations), the original PMD alignment logic caused a cascade of side effects:

The performance improvement comes from how that interacts with dynamic memory allocation patterns in AI inference workloads, especially those using frameworks like Hugging Face Transformers.

In our specific use case, the workloads were running on Intel Developer Cloud, but I no longer have access to that particular environment or the original profiling output. However, I’d like to highlight why this patch had such an outsized effect:

🔹 1. Fragmentation Avoidance
In model shard loading (e.g., large BERT or GPT2 models split into multiple memory segments), many medium-sized anonymous allocations occur in rapid succession. These workloads tend to allocate many 512 KB – 1.5 MB buffers dynamically (token buffers, intermediate tensors). Aligning each one to PMD size, even when their length wasn’t PMD-aligned, led to gaps between them — defeating natural coalescing into a single THP.

🔹 2. TLB aliasing and cache index pressure

These fragmented mappings caused frequent TLB misses and poor L1/L2 cache reuse.

The result was what looks like “memory thrashing,” with slow memory access dominating total inference time.
When every mapping is PMD-aligned (even if not PMD-sized), the gaps between them prevent Transparent Huge Pages (THPs) from activating effectively.

This breaks THP coalescence and causes fragmented page tables and higher memory overhead per shard.

🔹 3. Latency & Throughput Penalty from Memory Misalignment
This leads to higher TLB miss rates, especially under multi-threaded load, which dramatically slows down token embedding and attention calculations.

When loading model shards, memory initialization becomes cache-unfriendly, with poor reuse across cores.

This affects not only inference latency but also model cold-start time — which is critical in autoscaling deployments.

🔹 4. Qualitative Observation
Without this patch: shard loading stuttered, warm-up was slow, and we saw CPU cycles dominated by page_fault and TLB miss handlers.

With this patch: shard loading smoothed out, THPs were correctly applied (based on smaps), and throughput shot up by an order of magnitude.

🔹 5. Measured Impact
On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on non-aligned sizes showed 11–32× worse performance.

With the patched kernel (which skips alignment unless the length is PMD-aligned), memory layout was contiguous again and THP was consistently utilized.

This isn’t about one extra THP — it’s about preventing widespread THP fragmentation and the resulting dramatic cache/TLB degradation. For AI workloads with high concurrency and dynamic shapes, this small patch has a massive effect on layout and locality.

So, it's not just “1 more huge page” — it's avoiding massive fragmentation that leads to:

1. TLB miss storms

2. Poor locality

3. Cache index thrashing

4. Improvement in latency and throughput

This applies across many adjacent, odd-length allocations typical of AI inference workloads.

The original alignment logic created a pattern of broken contiguity — defeating THP benefits altogether.

In AI workloads using Hugging Face Transformers, model shards and intermediate tensors are dynamically allocated during inference. These allocations often fall just below or above the 2MB threshold that THP relies on. Misalignment or forced alignment to PMD boundaries causes fragmentation and disrupts huge page coalescence, affecting performance.

📊 Memory Allocation Pattern Diagram

Without Patch (PMD Alignment Forced):

|<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
| Alloc A |         | Alloc B |         | Alloc C |

Each allocation is PMD-aligned, even if it’s not PMD-sized

Gaps prevent THP coalescence → TLB/cache fragmentation

With Patch (PMD Alignment Conditional):

|<---------6MB Contiguous Region--------->|
|  Alloc A  | Alloc B | Alloc C | Padding |

Contiguous anonymous memory region

Coalesced into one or more THPs

Improved locality and TLB efficiency

While I regret not having the raw perf output at hand, I’d be happy to replicate a similar test locally and share reproducible results if helpful.

Best Regards,

Siddhartha Sharma

Thanks for your detailed explanation! I misunderstood that the optimization you were talking about

was due to efa7df3e3bb5, instead it was due to the alignment. Your explanation makes a lot of

sense!


For this workload, do you enable mTHPs on your system? My plan is to make a similar patch for

the mTHP case and I'd be grateful if you can get me some results : )