Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

From: Dave Hansen
Date: Wed Jul 01 2020 - 12:48:30 EST


On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to
> circumvent their cpuset.mems?

In its current form, yes.

My current rationale for this is that while it's not as deferential as
it can be to the user/kernel ABI contract, it's good *overall* behavior.
The auto-migration only kicks in when the data is about to go away. So
while the user's data might be slower than they like, it is *WAY* faster
than they deserve because it should be off on the disk.

> Because we don't have a mapping of the page back to its allocation
> context (or the process context in which it was allocated), it seems like
> both are possible.
>
> So let's assume that migration nodes cannot be other DRAM nodes.
> Otherwise, memory pressure could be intentionally or unintentionally
> induced to migrate these pages to another node. Do we have such a
> restriction on migration nodes?

There's nothing explicit. On a normal, balanced system where there's a
1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
implicit since the migration path is one deep and goes from DRAM->PMEM.

If there were some oddball system where there was a memory only DRAM
node, it might very well end up being a migration target.

>> Some places we would like to see this used:
>>
>> 1. Persistent memory being as a slower, cheaper DRAM replacement
>> 2. Remote memory-only "expansion" NUMA nodes
>> 3. Resolving memory imbalances where one NUMA node is seeing more
>> allocation activity than another. This helps keep more recent
>> allocations closer to the CPUs on the node doing the allocating.
>
> (3) is the concerning one given the above if we are to use
> migrate_demote_mapping() for DRAM node balancing.

Yeah, agreed. That's the sketchiest of the three. :)

>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> + /*
>> + * 'mask' targets allocation only to the desired node in the
>> + * migration path, and fails fast if the allocation can not be
>> + * immediately satisfied. Reclaim is already active and heroic
>> + * allocation efforts are unwanted.
>> + */
>> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> + __GFP_MOVABLE;
>
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
> actually want to kick kswapd on the pmem node?

In my mental model, cold data flows from:

DRAM -> PMEM -> swap

Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
for kinda cold data, kswapd can be working on doing the PMEM->swap part
on really cold data.

...
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>> ; /* try to reclaim the page below */
>> }
>>
>> + rc = migrate_demote_mapping(page);
>> + /*
>> + * -ENOMEM on a THP may indicate either migration is
>> + * unsupported or there was not enough contiguous
>> + * space. Split the THP into base pages and retry the
>> + * head immediately. The tail pages will be considered
>> + * individually within the current loop's page list.
>> + */
>> + if (rc == -ENOMEM && PageTransHuge(page) &&
>> + !split_huge_page_to_list(page, page_list))
>> + rc = migrate_demote_mapping(page);
>> +
>> + if (rc == MIGRATEPAGE_SUCCESS) {
>> + unlock_page(page);
>> + if (likely(put_page_testzero(page)))
>> + goto free_it;
>> + /*
>> + * Speculative reference will free this page,
>> + * so leave it off the LRU.
>> + */
>> + nr_reclaimed++;
>
> nr_reclaimed += nr_pages instead?

Oh, good catch. I also need to go double-check that 'nr_pages' isn't
wrong elsewhere because of the split.