[PATCH V3 0/4] sched/numa: Enhance vma scanning

From: Raghavendra K T
Date: Mon Feb 27 2023 - 23:51:33 EST


The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel. This is continuation of [3].

Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below

Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning of
areas the thread is not interested in. The above suggestion intends not
to penalize threads that has no interest in the vma, thus reduce scanning
overhead.

V3 changes are mostly based on PeterZ comments (details below in
changes)

Summary of patchset:
Current patchset implements:
1. Delay the vma scanning logic for newly created VMA's so that
additional overhead of scanning is not incurred for short lived tasks
(implementation by Mel)

2. Store the information of tasks accessing VMA in 2 windows. It is
regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval.
The above time is derived from experimenting (Suggested by PeterZ) to
balance between frequent clearing vs obsolete access data

3. hash_32 used to encode task index accessing VMA information

4. VMA's acess information is used to skip scanning for the tasks
which had not accessed VMA

Things to ponder over:
==========================================
- Improvement to clearing accessing PIDs logic (discussed in-detail in
patch3 itself (Done in this patchset by implementing 2 window history)

- Current scan period is not changed in the patchset, so we do see frequent
tries to scan. Relaxing scan period dynamically could improve results
further.

[1] sched/numa: Process Adaptive autoNUMA
Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@xxxxxxx/T/

[2] RFC V1 Link:
https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@xxxxxxx/

[3] V2 Link:
https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@xxxxxxx/

Changes since V2:
patch1:
- Renaming of structure, macro to function,
- Add explanation to heuristics
- Adding more details from result (PeterZ)
Patch2:
- Usage of test and set bit (PeterZ)
- Move storing access PID info to numa_migrate_prep()
- Add a note on fainess among tasks allowed to scan
(PeterZ)
Patch3:
- Maintain two windows of access PID information
(PeterZ supported implementation and Gave idea to extend
to N if needed)
Patch4:
- Apply hash_32 function to track VMA accessing PIDs (PeterZ)

Changes since RFC V1:
- Include Mel's vma scan delay patch
- Change the accessing pid store logic (Thanks Mel)
- Fencing structure / code to NUMA_BALANCING (David, Mel)
- Adding clearing access PID logic (Mel)
- Descriptive change log ( Mike Rapoport)

Results:
Summary: Huge autonuma cost reduction seen in mmtest. Kernbench and
dbench improvement is around 5% and huge system time (80%+) improvement
from mmtest autonuma.

kernbench
=============
6.1.0-base 6.1.0-patched
Amean user-256 22437.65 ( 0.00%) 22622.16 * -0.82%*
Amean syst-256 9290.30 ( 0.00%) 8763.85 * 5.67%*
Amean elsp-256 159.36 ( 0.00%) 157.44 * 1.20%*

Duration User 67322.16 67876.18
Duration System 27884.89 26306.28
Duration Elapsed 498.95 494.42

Ops NUMA alloc hit 1738904367.00 1738882062.00
Ops NUMA alloc local 1738904104.00 1738881490.00
Ops NUMA base-page range updates 440526.00 272095.00
Ops NUMA PTE updates 440526.00 272095.00
Ops NUMA hint faults 109109.00 55630.00
Ops NUMA hint local faults % 5474.00 196.00
Ops NUMA hint local percent 5.02 0.35
Ops NUMA pages migrated 103400.00 55434.00
Ops AutoNUMA cost 550.59 281.11

autonumabench
===============
6.1.0-base 6.1.0-patched
Amean syst-NUMA01 252.55 ( 0.00%) 27.71 * 89.03%*
Amean syst-NUMA01_THREADLOCAL 0.20 ( 0.00%) 0.23 * -12.77%*
Amean syst-NUMA02 0.91 ( 0.00%) 0.76 * 16.22%*
Amean syst-NUMA02_SMT 0.67 ( 0.00%) 0.67 * -1.07%*
Amean elsp-NUMA01 269.93 ( 0.00%) 309.44 * -14.64%*
Amean elsp-NUMA01_THREADLOCAL 1.05 ( 0.00%) 1.07 * -1.36%*
Amean elsp-NUMA02 3.26 ( 0.00%) 3.29 * -0.79%*
Amean elsp-NUMA02_SMT 3.73 ( 0.00%) 3.52 * 5.64%*

Duration User 318683.69 330084.06
Duration System 1780.77 206.14
Duration Elapsed 1954.30 2233.06


Ops NUMA alloc hit 62237331.00 49179090.00
Ops NUMA alloc local 62235222.00 49177092.00
Ops NUMA base-page range updates 85303091.00 29242.00
Ops NUMA PTE updates 85303091.00 29242.00
Ops NUMA hint faults 87457481.00 8302.00
Ops NUMA hint local faults % 66665145.00 6064.00
Ops NUMA hint local percent 76.23 73.04
Ops NUMA pages migrated 9348511.00 2232.00
Ops AutoNUMA cost 438062.15 41.76

dbench
========
dbench -t 90 <nproc>

Throughput
#clients base patched %improvement
1 842.655 MB/sec 922.305 MB/sec 9.45
16 5062.82 MB/sec 5079.85 MB/sec 0.34
64 9408.81 MB/sec 9980.89 MB/sec 6.08
256 7076.59 MB/sec 7590.76 MB/sec 7.26

Mel Gorman (1):
sched/numa: Apply the scan delay to every new vma

Raghavendra K T (3):
sched/numa: Enhance vma scanning logic
sched/numa: implement access PID reset logic
sched/numa: Use hash_32 to mix up PIDs accessing VMA

include/linux/mm.h | 30 +++++++++++++++++++++
include/linux/mm_types.h | 9 +++++++
kernel/fork.c | 2 ++
kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++
mm/memory.c | 3 +++
5 files changed, 101 insertions(+)

--
2.34.1