Re: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM

From: Huang\, Ying
Date: Wed Feb 19 2020 - 01:05:19 EST


Mel Gorman <mgorman@xxxxxxx> writes:

> On Tue, Feb 18, 2020 at 04:26:29PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@xxxxxxxxx>
>>
>> In a memory tiering system, if the memory size of the workloads is
>> smaller than that of the faster memory (e.g. DRAM) nodes, all pages of
>> the workloads should be put in the faster memory nodes. But this
>> makes it unnecessary to use slower memory (e.g. PMEM) at all.
>>
>> So in common cases, the memory size of the workload should be larger
>> than that of the faster memory nodes. And to optimize the
>> performance, the hot pages should be promoted to the faster memory
>> nodes while the cold pages should be demoted to the slower memory
>> nodes. To achieve that, we have two choices,
>>
>> a. Promote the hot pages from the slower memory node to the faster
>> memory node. This will create some memory pressure in the faster
>> memory node, thus trigger the memory reclaiming, where the cold
>> pages will be demoted to the slower memory node.
>>
>> b. Demote the cold pages from faster memory node to the slower memory
>> node. This will create some free memory space in the faster memory
>> node, and the hot pages in the slower memory node could be promoted
>> to the faster memory node.
>>
>> The choice "a" will create the memory pressure in the faster memory
>> node. If the memory pressure of the workload is high too, the memory
>> pressure may become so high that the memory allocation latency of the
>> workload is influenced, e.g. the direct reclaiming may be triggered.
>>
>> The choice "b" works much better at this aspect. If the memory
>> pressure of the workload is high, it will consume the free memory and
>> the hot pages promotion will stop earlier if its allocation watermark
>> is higher than that of the normal memory allocation.
>>
>> In this patch, choice "b" is implemented. If memory tiering NUMA
>> balancing mode is enabled, the node isn't the slowest node, and the
>> free memory size of the node is below the high watermark, the kswapd
>> of the node will be waken up to free some memory until the free memory
>> size is above the high watermark + autonuma promotion rate limit. If
>> the free memory size is below the high watermark, autonuma promotion
>> will stop working. This avoids to create too much memory pressure to
>> the system.
>>
>> Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
>
> Unfortunately I stopped reading at this point. It depends on another series
> entirely and they really need to be presented together instead of relying
> on searching mail archives to find other patches to try assemble the full
> picture :(. Ideally each stage would have supporting data showing roughly
> how it behaves at each major stage. I know this will be a pain but the
> original NUMA balancing had the same problem and ultimately started with
> one large series that got the basics right followed by other series that
> improved it in stages. That process is *still* ongoing today.

Sorry for inconvenience, we will post a new patchset including both
series and add supporting data at each major stage when possible.

Best Regards,
Huang, Ying