Re: RFC: Memory Tiering Kernel Interfaces

From: Baolin Wang
Date: Mon May 02 2022 - 22:06:38 EST




On 5/2/2022 1:58 AM, Davidlohr Bueso wrote:
Nice summary, thanks. I don't know who of the interested parties will be
at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
at 14:00 and 15:00.

On Fri, 29 Apr 2022, Wei Xu wrote:

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

Regardless of the promotion algorithm, at some point I see the NUMA hinting
fault mechanism being in the way of performance. It would be nice if hardware
began giving us page "heatmaps" instead of having to rely on faulting or
sampling based ways to identify hot memory.

A tiering relationship between NUMA nodes in the form of demotion path
is created during the kernel initialization and updated when a NUMA
node is hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and then builds the tiering hierarchy
tier-by-tier by establishing the per-node demotion targets based on
the distances between nodes.

The current memory tiering interface needs to be improved to address
several important use cases:

* The current tiering initialization code always initializes
 each memory-only NUMA node into a lower tier.  But a memory-only
 NUMA node may have a high performance memory device (e.g. a DRAM
 device attached via CXL.mem or a DRAM-backed memory-only node on
 a virtual machine) and should be put into the top tier.

At least the CXL memory (volatile or not) will still be slower than
regular DRAM, so I think that we'd not want this to be top-tier. But
in general, yes I agree that defining top tier as whether or not the
node has a CPU a bit limiting, as you've detailed here.

Tiering Hierarchy Initialization
================================

By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).

A device driver can remove its memory nodes from the top tier, e.g.
a dax driver can remove PMEM nodes from the top tier.

The kernel builds the memory tiering hierarchy and per-node demotion
order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
best distance nodes in the next lower tier are assigned to
node_demotion[N].preferred and all the nodes in the next lower tier
are assigned to node_demotion[N].allowed.

node_demotion[N].preferred can be empty if no preferred demotion node
is available for node N.

Upon cases where there more than one possible demotion node (with equal
cost), I'm wondering if we want to do something better than choosing
randomly, like we do now - perhaps round robin? Of course anything
like this will require actual performance data, something I have seen
very little of.

I've tried to use round robin[1] to select a target demotion node if there are multiple demotion nodes, however I did not see any obvious performance gain with mysql testing. Maybe use other test suits?

https://lore.kernel.org/all/c02bcbc04faa7a2c852534e9cd58a91c44494657.1636016609.git.baolin.wang@xxxxxxxxxxxxxxxxx/