[RFC PATCH v3 0/4] Node Weights and Weighted Interleave

From: Gregory Price
Date: Mon Oct 30 2023 - 20:38:31 EST


This patchset implements weighted interleave and adds a new sysfs
entry: /sys/devices/system/node/nodeN/accessM/il_weight.

The il_weight of a node is used by mempolicy to implement weighted
interleave when `numactl --interleave=...` is invoked. By default
il_weight for a node is always 1, which preserves the default round
robin interleave behavior.

Interleave weights may be set from 0-100, and denote the number of
pages that should be allocated from the node when interleaving
occurs.

For example, if a node's interleave weight is set to 5, 5 pages
will be allocated from that node before the next node is scheduled
for allocations.

Additionally, "node accessors" (synonmous with cpu nodes) are used
to allow for accessor-relative weighting. The "accessor" for a task
is defined as the node the task is presently running on.

# Set node weight for node0 accessed by tasks on node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight

# Set node weight for node0 accessed by tasks on node1 to 3
echo 3 > /sys/devices/system/node/node0/access1/il_weight

In this way it becomes possible to set an interleaving strategy
that fits the available bandwidth for the devices available on
the system. An example system:

Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex

In this setup, the effective weights for nodes 0-3 for a task
running on Node 0 may be [60, 20, 10, 10].

This spreads memory out across devices which all have different
latency and bandwidth attributes at a way that can maximize the
available resources.

~Gregory

(sorry for the repeat send, automation failure)

================================================================

Version Notes:

v3: move weights into node rather than memtiers
some additional fixes to node.c to support this

v1/v2: add weighted-interleave support to mempolicy

= v3 notes

This update effectively removes the connection between mempolicy
and memory-tiers by simply placing the interleave weights directly
in the node accessor information structure.

Node was recommended by Huang, Ying
Accessor was recommended by Ravi Shankar


== Move weights into node

Originally this work was done by placing weights in the memory tier.
In this patch set we changed the weights to live in the numa node
accessor structure, which allows for a more natural weighting scheme
and also supports source-node relative weighting.

Interleave weight is located in:
/sys/devices/system/node/nodeN/accessM/il_weight

and is set with a value between 1 and 100:

# Set node weight for node0 accessed by node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight

By default, il_weight is always set to 1, which mimics the default
interleave behavior (simple round-robin).


== Other Node fixes

2 other updates to node.c were required to support this:

1) The access list must be initialized prior to the node struct
pointer being registered in the node array

2) The accessor's in the list must be registered regardless of
whether HMAT/HMEM information is reported. Presently this
results in 0-value information being present in the various
access subgroup


== Weighted interleave

mm/mempolicy: modify interleave mempolicy to use node weights

The node subsystem implements interleave weighting for the purpose
of bandwidth optimization. Each node may have different weights in
relation to each compute node ("access node").

The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement
weighted interleave. By default, since all nodes default to a weight
of 1, the original interleave behavior is retained.

Examples

Weight settings:
echo 4 > node0/access0/il_weight
echo 3 > node1/access0/il_weight
echo 2 > node1/access1/il_weight
echo 1 > node0/access1/il_weight

Results:

Task A:
cpunode: 0
nodemask: [0,1]
weights: [4,3]
allocation result: [0,0,0,0,1,1,1 repeat]

Task B:
cpunode: 1
nodemask: [0,1]
weights: [1,2]
allocation result: [0,1,1 repeat]

=== original RFCs ====

Memory-tier based weights
By: Ravi Shankar
https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@xxxxxxxxxx/

Mempolicy multi-node weighting w/ set_mempolicy2:
By: Gregory Price
https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@xxxxxxxxxxxx/

N:M weighting in mempolicy
By: Hasan Al Maruf
https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/

Ying Huang's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

Gregory Price (4):
base/node.c: initialize the accessor list before registering
node: add accessors to sysfs when nodes are created
node: add interleave weights to node accessor
mm/mempolicy: modify interleave mempolicy to use node weights

drivers/base/node.c | 120 ++++++++++++++++++++++++++++++++-
include/linux/mempolicy.h | 4 ++
include/linux/node.h | 17 +++++
mm/mempolicy.c | 138 +++++++++++++++++++++++++++++---------
4 files changed, 246 insertions(+), 33 deletions(-)

--
2.39.1