Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

From: David Hildenbrand
Date: Mon Oct 30 2023 - 04:01:53 EST


On 30.10.23 08:25, Byungchul Park wrote:
Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
We always face the migration overhead at either promotion or demotion,
while working with tiered memory e.g. CXL memory and found out TLB
shootdown is a quite big one that is needed to get rid of if possible.

Fortunately, TLB flush can be defered or even skipped if both source and
destination of folios during migration are kept until all TLB flushes
required will have been done, of course, only if the target PTE entries
have read only permission, more precisely speaking, don't have write
permission. Otherwise, no doubt the folio might get messed up.

To achieve that:

1. For the folios that map only to non-writable TLB entries, prevent
TLB flush at migration by keeping both source and destination
folios, which will be handled later at a better time.

2. When any non-writable TLB entry changes to writable e.g. through
fault handler, give up CONFIG_MIGRC mechanism so as to perform
TLB flush required right away.

3. Temporarily stop migrc from working when the system is in very
high memory pressure e.g. direct reclaim needed.

The measurement result:

Architecture - x86_64
QEMU - kvm enabled, host cpu
Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)

run 'perf stat' using events:
1) itlb.itlb_flush
2) tlb_flush.dtlb_thread
3) tlb_flush.stlb_any
4) dTLB-load-misses
5) dTLB-store-misses
6) iTLB-load-misses

run 'cat /proc/vmstat' and pick:
1) numa_pages_migrated
2) pgmigrate_success
3) nr_tlb_remote_flush
4) nr_tlb_remote_flush_received
5) nr_tlb_local_flush_all
6) nr_tlb_local_flush_one

BEFORE - mainline v6.6-rc5
------------------------------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 50000000

Performance counter stats for 'system wide':

20953405 itlb.itlb_flush
114886593 tlb_flush.dtlb_thread
88267015 tlb_flush.stlb_any
115304095543 dTLB-load-misses
163904743 dTLB-store-misses
608486259 iTLB-load-misses

556.787113849 seconds time elapsed

$ cat /proc/vmstat

...
numa_pages_migrated 3378748
pgmigrate_success 7720310
nr_tlb_remote_flush 751464
nr_tlb_remote_flush_received 10742115
nr_tlb_local_flush_all 21899
nr_tlb_local_flush_one 740157
...

AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
------------------------------------------
$ perf stat -a \
-e itlb.itlb_flush \
-e tlb_flush.dtlb_thread \
-e tlb_flush.stlb_any \
-e dTLB-load-misses \
-e dTLB-store-misses \
-e iTLB-load-misses \
./XSBench -p 50000000

Performance counter stats for 'system wide':

4353555 itlb.itlb_flush
72482780 tlb_flush.dtlb_thread
68226458 tlb_flush.stlb_any
114331610808 dTLB-load-misses
116084771 dTLB-store-misses
377180518 iTLB-load-misses

552.667718220 seconds time elapsed

$ cat /proc/vmstat


So, an improvement of 0.74% ? How stable are the results? Serious question: worth the churn?

Or did I get the numbers wrong?

#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5c02720c53a5..1ca2ac91aa14 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,9 @@ enum pageflags {
#ifdef CONFIG_ARCH_USES_PG_ARCH_X
PG_arch_2,
PG_arch_3,
+#endif
+#ifdef CONFIG_MIGRC
+ PG_migrc, /* Page has its copy under migrc's control */
#endif
__NR_PAGEFLAGS,
@@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif
+#ifdef CONFIG_MIGRC
+PAGEFLAG(Migrc, migrc, PF_ANY)
+#endif

I assume you know this: new pageflags are frowned upon.

--
Cheers,

David / dhildenb