Re: [RFC -V5 1/6] NUMA balancing: optimize page placement for memory tiering system

From: Huang, Ying
Date: Fri Feb 05 2021 - 03:39:16 EST


Hillf Danton <hdanton@xxxxxxxx> writes:

> On Thu, 4 Feb 2021 18:10:51 +0800 Huang Ying wrote:
>> With the advent of various new memory types, some machines will have
>> multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
>> memory subsystem of these machines can be called memory tiering
>> system, because the performance of the different types of memory are
>> usually different.
>>
>> In such system, because of the memory accessing pattern changing etc,
>> some pages in the slow memory may become hot globally. So in this
>> patch, the NUMA balancing mechanism is enhanced to optimize the page
>> placement among the different memory types according to hot/cold
>> dynamically.
>>
>> In a typical memory tiering system, there are CPUs, fast memory and
>> slow memory in each physical NUMA node. The CPUs and the fast memory
>> will be put in one logical node (called fast memory node), while the
>> slow memory will be put in another (faked) logical node (called slow
>> memory node). That is, the fast memory is regarded as local while the
>> slow memory is regarded as remote. So it's possible for the recently
>> accessed pages in the slow memory node to be promoted to the fast
>> memory node via the existing NUMA balancing mechanism.
>>
>> The original NUMA balancing mechanism will stop to migrate pages if the free
>> memory of the target node will become below the high watermark. This
>> is a reasonable policy if there's only one memory type. But this
>> makes the original NUMA balancing mechanism almost not work to optimize page
>> placement among different memory types. Details are as follows.
>>
>> It's the common cases that the working-set size of the workload is
>> larger than the size of the fast memory nodes. Otherwise, it's
>> unnecessary to use the slow memory at all. So in the common cases,
>> there are almost always no enough free pages in the fast memory nodes,
>> so that the globally hot pages in the slow memory node cannot be
>
> In assumption like
>
> 1/ the workload's working set size is 1.5x larger than one DRAM node,
> 2/ PMEM is 10x (or 5x) larger than DRAM,
>
> what difference is it going to make if the spinning hard disk swap
> can be replaced with PMEM? With PMEM swap, the page demotion is swapout
> and we will pay nothing for page promotion.

Per my understanding, this is the difference between PMEM as swap and
accessing PMEM directly + promotion.

PMEM as swap:

- PMEM will not be accessed directly, that is, any DRAM miss will
trigger swapping in. That is, 1 cache line access will be inflated as
4KB accessing (4096 / 64 = 64). And page direct reclaiming may be
triggered, so the accessing latency is almost unbounded.

- The good part is that if the PMEM page is very hot, we will put the
page in DRAM at the first accessing.

promotion + accessing PMEM directly:

- PMEM may be accessed directly. The latency of PMEM is longer than
that of DRAM, but much smaller than that of swapping in. And we avoid
to trigger direct reclaiming for page promotion.

- The bad part is that the very hot PMEM page may be accessed directly
for a while before being promoted to DRAM. It takes some time to
identify whether a page is hot or not.

So in another words, swap can guarantee the very hot pages to be
accessed in DRAM always, but promotion + accessing PMEM directly
solution can avoid to move very cold pages to DRAM so that the page
thrashing can be avoided.

If the pages we put in PMEM will almost never been accessed, then PMEM
as swap may be the suitable solution too. But if it's not, promotion +
accessing PMEM directly works generally better.

Best Regards,
Huang, Ying

[snip]