Re: [PATCH v7 11/12] mm/demotion: Add documentation for memory tiering

From: Aneesh Kumar K.V
Date: Mon Jun 27 2022 - 00:41:41 EST


Bagas Sanjaya <bagasdotme@xxxxxxxxx> writes:

> On Wed, Jun 22, 2022 at 01:55:12PM +0530, Aneesh Kumar K.V wrote:
>> From: Jagdish Gediya <jvgediya@xxxxxxxxxxxxx>
>>
>
> Hi Aneesh and Jagdish,
>
> The documentation can be improved, see below.
>
>> All N_MEMORY nodes are divided into 3 memoty tiers with tier ID value
>> MEMORY_TIER_HBM_GPU, MEMORY_TIER_DRAM and MEMORY_TIER_PMEM. By default,
>> all nodes are assigned to default memory tier.
>>
>> Demotion path for all N_MEMORY nodes is prepared based on the tier ID value
>> of memory tiers.
>>
>> This patch adds documention for memory tiering introduction, its sysfs
>> interfaces and how demotion is performed based on memory tiers.
>>
>
> I think the patch message should just be:
> "Add documentation for memory tiering. It also covers its sysfs
> interfaces and how demotion is performed based on memory tiers."
>
>> +===========
>> +Memory tiers
>> +============
>> +
>> +This document describes explicit memory tiering support along with
>> +demotion based on memory tiers.
>> +
>
> This causes htmldocs error, for which I have applied the fixup at [1].
>
>> +Memory nodes are divided into 3 types of memory tiers with tier ID
>> +value as shown based on their hardware characteristics.
>> +
>> +
>> +MEMORY_TIER_HBM_GPU
>> +MEMORY_TIER_DRAM
>> +MEMORY_TIER_PMEM
>> +
>
> Use bullet list.
>
>> +Sysfs interfaces
>> +================
>> +
>> +Nodes belonging to specific tier can be read from,
>> +/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
>> +
>> +Where N is 0 - 2.
>
> The "where" sentence can be compounded into the previous sentence above.
>
>> +
>> +Example 1:
>> +For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
>> +node 2 is a PMEM node an ideal tier layout will be
>> +
>> +$ cat /sys/devices/system/memtier/memtier0/nodelist
>> +1
>> +$ cat /sys/devices/system/memtier/memtier1/nodelist
>> +0
>> +$ cat /sys/devices/system/memtier/memtier2/nodelist
>> +2
>> +
>
> The code snippets should have been inside literal code blocks.
>
>> +Example 2:
>> +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
>> +nodes.
>> +
>> +$ cat /sys/devices/system/memtier/memtier0/nodelist
>> +cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
>> +directory
>> +$ cat /sys/devices/system/memtier/memtier1/nodelist
>> +0-1
>> +$ cat /sys/devices/system/memtier/memtier2/nodelist
>> +2-3
>> +
>
> Use literal code block.
>
>> +Default memory tier can be read from,
>> +/sys/devices/system/memtier/default_tier (Read-Only)
>> +
>> +e.g.
>> +$ cat /sys/devices/system/memtier/default_tier
>> +memtier200
>> +
>> +Max memory tier ID supported can be read from,
>> +/sys/devices/system/memtier/max_tier (Read-Only)
>> +
>> +e.g.
>> +$ cat /sys/devices/system/memtier/max_tier
>> +400
>> +
>> +Individual node's memory tier can be read of set using,
>> +/sys/devices/system/node/nodeN/memtier (Read-Write)
>> +
>> +where N = node id
>> +
>> +When this interface is written, Node is moved from the old memory tier
>> +to new memory tier and demotion targets for all N_MEMORY nodes are
>> +built again.
>> +
>> +For example 1 mentioned above,
>> +$ cat /sys/devices/system/node/node0/memtier
>> +1
>> +$ cat /sys/devices/system/node/node1/memtier
>> +0
>> +$ cat /sys/devices/system/node/node2/memtier
>> +2
>> +
>
> The same suggestions above apply here, too.
>
>> +Enable/Disable demotion
>> +-----------------------
>> +
>> +By default demotion is disabled, it can be enabled/disabled using
>> +below sysfs interface,
>> +
>> +$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
>> +
>
> Use literal code block.
>
>> +preferred and allowed demotion nodes
>> +------------------------------------
>> +
>> +Preferred nodes for a specific N_MEMORY node are the best nodes
>> +from the next possible lower memory tier. Allowed nodes for any
>> +node are all the nodes available in all possible lower memory
>> +tiers.
>> +
>> +Example:
>> +
>> +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
>> +nodes,
>> +
>> +node distances:
>> +node 0 1 2 3
>> + 0 10 20 30 40
>> + 1 20 10 40 30
>> + 2 30 40 10 40
>> + 3 40 30 40 10
>> +
>
> Use reST table.
>
>> +memory_tiers[0] = <empty>
>> +memory_tiers[1] = 0-1
>> +memory_tiers[2] = 2-3
>> +
>> +node_demotion[0].preferred = 2
>> +node_demotion[0].allowed = 2, 3
>> +node_demotion[1].preferred = 3
>> +node_demotion[1].allowed = 3, 2
>> +node_demotion[2].preferred = <empty>
>> +node_demotion[2].allowed = <empty>
>> +node_demotion[3].preferred = <empty>
>> +node_demotion[3].allowed = <empty>
>> +
>
> What are these above? Node properties? BTW, use literal code block.
>
> If you don't understand these suggestions above, here is the diff:

I got with the below diff.
patch: **** malformed patch at line 180: @@ -148,35 +153,40 @@ from the next possible lower memory tier. Allowed nodes for any

But I did modify the documentation based on your feedback and it is much
better than what I had. Thanks for the review. I will send v8 with the
changes folded. I did add the below to commit message. Hope that is ok.

[update doc format by Bagas Sanjaya <bagasdotme@xxxxxxxxx>]

>
> ---- >8 ----
>
> diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst
> index 0a75e0dab1fd8e..10ec5aab6ddd53 100644
> --- a/Documentation/admin-guide/mm/memory-tiering.rst
> +++ b/Documentation/admin-guide/mm/memory-tiering.rst
> @@ -14,13 +14,13 @@ Introduction
>
> Many systems have multiple types of memory devices e.g. GPU, DRAM and
> PMEM. The memory subsystem of these systems can be called a memory
> -tiering system because the performance of the different types of
> +tiering system because the performance of each type of
> memory is different. Memory tiers are defined based on the hardware
> capabilities of memory nodes. Each memory tier is assigned a tier ID
> value that determines the memory tier position in demotion order.
>
> The memory tier assignment of each node is independent of each
> -other. Moving a node from one tier to another tier doesn't affect
> +other. Moving a node from one tier to another doesn't affect
> the tier assignment of any other node.
>
> Memory tiers are used to build the demotion targets for nodes. A node
> @@ -32,10 +32,9 @@ Memory tier rank
> Memory nodes are divided into 3 types of memory tiers with tier ID
> value as shown based on their hardware characteristics.
>
> -
> -MEMORY_TIER_HBM_GPU
> -MEMORY_TIER_DRAM
> -MEMORY_TIER_PMEM
> + * MEMORY_TIER_HBM_GPU
> + * MEMORY_TIER_DRAM
> + * MEMORY_TIER_PMEM
>
> Memory tiers initialization and (re)assignments
> ===============================================
> @@ -49,68 +48,73 @@ hotplug, the memory tier with default tier ID is assigned to the memory node.
> Sysfs interfaces
> ================
>
> -Nodes belonging to specific tier can be read from,
> -/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
> +Nodes belonging to specific tier can be read from
> +/sys/devices/system/memtier/memtierN/nodelist, where N is 0 - 2 (read-only)
>
> -Where N is 0 - 2.
> +Examples:
>
> -Example 1:
> -For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
> -node 2 is a PMEM node an ideal tier layout will be
> +1. On a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
> + node 2 is a PMEM node an ideal tier layout will be:
>
> -$ cat /sys/devices/system/memtier/memtier0/nodelist
> -1
> -$ cat /sys/devices/system/memtier/memtier1/nodelist
> -0
> -$ cat /sys/devices/system/memtier/memtier2/nodelist
> -2
> + .. code-block::
>
> -Example 2:
> -For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> -nodes.
> + $ cat /sys/devices/system/memtier/memtier0/nodelist
> + 1
> + $ cat /sys/devices/system/memtier/memtier1/nodelist
> + 0
> + $ cat /sys/devices/system/memtier/memtier2/nodelist
> + 2
>
> -$ cat /sys/devices/system/memtier/memtier0/nodelist
> -cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
> -directory
> -$ cat /sys/devices/system/memtier/memtier1/nodelist
> -0-1
> -$ cat /sys/devices/system/memtier/memtier2/nodelist
> -2-3
> +2. On a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> + nodes:
>
> -Default memory tier can be read from,
> -/sys/devices/system/memtier/default_tier (Read-Only)
> + .. code-block::
>
> -e.g.
> -$ cat /sys/devices/system/memtier/default_tier
> -memtier200
> + $ cat /sys/devices/system/memtier/memtier0/nodelist
> + cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
> + directory
> + $ cat /sys/devices/system/memtier/memtier1/nodelist
> + 0-1
> + $ cat /sys/devices/system/memtier/memtier2/nodelist
> + 2-3
>
> -Max memory tier ID supported can be read from,
> -/sys/devices/system/memtier/max_tier (Read-Only)
> +Default memory tier can be read from
> +/sys/devices/system/memtier/default_tier (read-only), e.g.:
>
> -e.g.
> -$ cat /sys/devices/system/memtier/max_tier
> -400
> +.. code-block::
>
> -Individual node's memory tier can be read of set using,
> -/sys/devices/system/node/nodeN/memtier (Read-Write)
> + $ cat /sys/devices/system/memtier/default_tier
> + memtier200
>
> -where N = node id
> +Max memory tier ID supported can be read from
> +/sys/devices/system/memtier/max_tier (read-only), e.g.:
>
> -When this interface is written, Node is moved from the old memory tier
> +.. code-block::
> +
> + $ cat /sys/devices/system/memtier/max_tier
> + 400
> +
> +Individual node's memory tier can be read or set using
> +/sys/devices/system/node/nodeN/memtier (read-write), where N = node id.
> +
> +When this interface is written, node is moved from the old memory tier
> to new memory tier and demotion targets for all N_MEMORY nodes are
> built again.
>
> -For example 1 mentioned above,
> -$ cat /sys/devices/system/node/node0/memtier
> -1
> -$ cat /sys/devices/system/node/node1/memtier
> -0
> -$ cat /sys/devices/system/node/node2/memtier
> -2
> +For example 1 mentioned above:
> +
> +.. code-block::
> +
> + $ cat /sys/devices/system/node/node0/memtier
> + 1
> + $ cat /sys/devices/system/node/node1/memtier
> + 0
> + $ cat /sys/devices/system/node/node2/memtier
> + 2
>
> Additional memory tiers can be created by writing a tier ID value to this file.
> -This results in a new memory tier creation and moving the specific NUMA node to
> -that memory tier.
> +This results into creating a new tier and moving the specific NUMA node to
> +that tier.
>
> Demotion
> ========
> @@ -128,19 +132,20 @@ be used.
>
> Instead of a page being discarded during reclaim, it can be moved to
> persistent memory. Allowing page migration during reclaim enables
> -these systems to migrate pages from fast(higher) tiers to slow(lower)
> -tiers when the fast(higher) tier is under pressure.
> +these systems to migrate pages from fast (higher) tiers to slow (lower)
> +tiers when the fast (higher) tier is under pressure.
>
>
> Enable/Disable demotion
> -----------------------
>
> -By default demotion is disabled, it can be enabled/disabled using
> -below sysfs interface,
> +By default demotion is disabled. It can be toggled by:
>
> -$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
> +.. code-block::
>
> -preferred and allowed demotion nodes
> + $ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
> +
> +Preferred and allowed demotion nodes
> ------------------------------------
>
> Preferred nodes for a specific N_MEMORY node are the best nodes
> @@ -148,35 +153,40 @@ from the next possible lower memory tier. Allowed nodes for any
> node are all the nodes available in all possible lower memory
> tiers.
>
> -Example:
> +For example, on a system where Node 0 & 1 are CPU + DRAM nodes,
> +node 2 & 3 are PMEM nodes:
>
> -For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
> -nodes,
> + * node distances
>
> -node distances:
> -node 0 1 2 3
> - 0 10 20 30 40
> - 1 20 10 40 30
> - 2 30 40 10 40
> - 3 40 30 40 10
> + ==== == == == ==
> + node 0 1 2 3
> + ==== == == == ==
> + 0 10 20 30 40
> + 1 20 10 40 30
> + 2 30 40 10 40
> + 3 40 30 40 10
> + ==== == == == ==
>
> -memory_tiers[0] = <empty>
> -memory_tiers[1] = 0-1
> -memory_tiers[2] = 2-3
> + * node properties
>
> -node_demotion[0].preferred = 2
> -node_demotion[0].allowed = 2, 3
> -node_demotion[1].preferred = 3
> -node_demotion[1].allowed = 3, 2
> -node_demotion[2].preferred = <empty>
> -node_demotion[2].allowed = <empty>
> -node_demotion[3].preferred = <empty>
> -node_demotion[3].allowed = <empty>
> + .. code-block::
> +
> + memory_tiers[0] = <empty>
> + memory_tiers[1] = 0-1
> + memory_tiers[2] = 2-3
> +
> + node_demotion[0].preferred = 2
> + node_demotion[0].allowed = 2, 3
> + node_demotion[1].preferred = 3
> + node_demotion[1].allowed = 3, 2
> + node_demotion[2].preferred = <empty>
> + node_demotion[2].allowed = <empty>
> + node_demotion[3].preferred = <empty>
> + node_demotion[3].allowed = <empty>
>
> Memory allocation for demotion
> ------------------------------
>
> -If a page needs to be demoted from any node, the kernel 1st tries
> -to allocate a new page from the node's preferred node and fallbacks to
> -node's allowed targets in allocation fallback order.
> -
> +If a page needs to be demoted from any node, the kernel first tries
> +to allocate a new page from the node's preferred target node and fallbacks
> +to node's allowed targets in allocation fallback order.
>
>
> Thanks.
>
> [1]: https://lore.kernel.org/linux-doc/YrZ5cTFOSuWxlF2t@xxxxxxxxx/
>
> --
> An old man doll... just what I always wanted! - Clara