Re: [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing

From: Raghavendra K T
Date: Thu Sep 07 2023 - 22:19:26 EST


On 9/3/2023 1:44 PM, Mateusz Guzik wrote:
On Wed, Aug 30, 2023 at 11:49:49AM -0700, Ankur Arora wrote:
This series adds a multi-page clearing primitive, clear_pages(),
which enables more effective use of x86 string instructions by
advertising the real region-size to be cleared.

Region-size can be used as a hint by uarchs to optimize the
clearing.

Also add allow_resched() which marks a code-section as allowing
rescheduling in the irqentry_exit path. This allows clear_pages()
to get by without having to call cond_sched() periodically.
(preempt_model_full() already handles this via
irqentry_exit_cond_resched(), so we handle this similarly for
preempt_model_none() and preempt_model_voluntary().)

Performance
==

With this demand fault performance gets a decent increase:

*Milan* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 14.55 19.29 +32.5%
pg-sz=1GB 19.34 49.60 +156.4%

Milan (and some other AMD Zen uarchs tested) take advantage of the
hint to elide cacheline allocation for pg-sz=1GB. The cut-off for
this optimization seems to be at around region-size > LLC-size so
the pg-sz=2MB load still allocates cachelines.


Have you benchmarked clzero? It is an AMD-specific instruction issuing
non-temporal stores. It is definitely something to try out for 1G pages.

One would think rep stosq has to be at least not worse since the CPU is
explicitly told what to do and is free to optimize it however it sees
fit, but the rep prefix has a long history of underperforming.

I'm not saying it is going to be better, but that this should be tested,
albeit one can easily argue this can be done at a later date.

I would do it myself but my access to AMD CPUs is limited.


Hello Mateuz,

I plugged in CLZERO unconditionally (even for coherent path with
sfence) for my earlier experimets on top of this series.

Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA
node0), for both base-hugepage-size=2M and 1GB

perf stat -r 10 -d -d numactl -m 0 -N 0 <test>

SUT: AMD Bergamo with 2 node/2 socket 128 cores per socket.

From that I see time taken is:
for 2M: 1.092125
for 1G: 0.997661

So overall for 64GB size experiment result look like this:
Time taken for 64GB region, (lesser = better)

page-size base patched (gain%) patched-clzero (gain%)
2M 5.0779 2.50623 (50.64) 1.092125 (78)
1G 2.50623 1.012439 (59.60) 0.997661 (60)

In summary I further see improvements for even for 2M base size (2.5x).

Overall CLZERO clearing is promising. But we may need threshold tuning
and hint passing as done in Ankurs'
Link: https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@xxxxxxxxxx/
on top of current series.

I need to experiment with different chunk size as well as base size
further. (both clzero and rep stos)

Thanks and Regards
- Raghu

Run Details:
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):

996.34 msec task-clock # 0.999 CPUs utilized ( +- 0.02% )
2 context-switches # 2.007 /sec ( +- 21.34% )
0 cpu-migrations # 0.000 /sec
212 page-faults # 212.735 /sec ( +- 0.20% )
3,116,497,471 cycles # 3.127 GHz ( +- 0.02% ) (35.66%)
100,343 stalled-cycles-frontend # 0.00% frontend cycles idle ( +- 16.85% ) (35.75%)
1,369,118 stalled-cycles-backend # 0.04% backend cycles idle ( +- 3.45% ) (35.86%)
4,325,987,025 instructions # 1.39 insn per cycle
# 0.00 stalled cycles per insn ( +- 0.02% ) (35.87%)
1,078,119,163 branches # 1.082 G/sec ( +- 0.01% ) (35.87%)
87,907 branch-misses # 0.01% of all branches ( +- 5.22% ) (35.83%)
12,337,100 L1-dcache-loads # 12.380 M/sec ( +- 5.44% ) (35.74%)
280,300 L1-dcache-load-misses # 2.48% of all L1-dcache accesses ( +- 5.74% ) (35.64%)
1,464,549 L1-icache-loads # 1.470 M/sec ( +- 1.61% ) (35.63%)
30,659 L1-icache-load-misses # 2.12% of all L1-icache accesses ( +- 3.30% ) (35.62%)
17,366 dTLB-loads # 17.426 K/sec ( +- 5.52% ) (35.63%)
11,774 dTLB-load-misses # 81.79% of all dTLB cache accesses ( +- 7.94% ) (35.63%)
0 iTLB-loads # 0.000 /sec (35.63%)
2 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +-342.39% ) (35.64%)

0.997661 +- 0.000150 seconds time elapsed ( +- 0.02% )


Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb' (10 runs):

1,089.97 msec task-clock # 0.998 CPUs utilized ( +- 0.03% )
3 context-switches # 2.750 /sec ( +- 15.11% )
0 cpu-migrations # 0.000 /sec
32,917 page-faults # 30.172 K/sec ( +- 0.00% )
3,408,713,422 cycles # 3.124 GHz ( +- 0.03% ) (35.60%)
982,417 stalled-cycles-frontend # 0.03% frontend cycles idle ( +- 2.77% ) (35.60%)
8,495,409 stalled-cycles-backend # 0.25% backend cycles idle ( +- 6.12% ) (35.59%)
4,970,939,278 instructions # 1.46 insn per cycle
# 0.00 stalled cycles per insn ( +- 0.04% ) (35.64%)
1,196,644,653 branches # 1.097 G/sec ( +- 0.03% ) (35.73%)
196,584 branch-misses # 0.02% of all branches ( +- 2.79% ) (35.78%)
226,254,284 L1-dcache-loads # 207.388 M/sec ( +- 0.23% ) (35.78%)
1,161,607 L1-dcache-load-misses # 0.52% of all L1-dcache accesses ( +- 3.27% ) (35.78%)
21,757,775 L1-icache-loads # 19.943 M/sec ( +- 0.66% ) (35.77%)
165,503 L1-icache-load-misses # 0.78% of all L1-icache accesses ( +- 3.11% ) (35.78%)
1,118,573 dTLB-loads # 1.025 M/sec ( +- 1.38% ) (35.78%)
415,943 dTLB-load-misses # 37.10% of all dTLB cache accesses ( +- 1.12% ) (35.78%)
36 iTLB-loads # 32.998 /sec ( +- 18.47% ) (35.74%)
49,785 iTLB-load-misses # 270570.65% of all iTLB cache accesses ( +- 0.34% ) (35.65%)

1.092125 +- 0.000350 seconds time elapsed ( +- 0.03% )