Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration

From: Fengguang Wu
Date: Fri Dec 28 2018 - 04:42:24 EST


On Fri, Dec 28, 2018 at 09:41:05AM +0100, Michal Hocko wrote:
On Fri 28-12-18 13:08:06, Wu Fengguang wrote:
[...]
Optimization: do hot/cold page tracking and migration
=====================================================

Since PMEM is slower than DRAM, we need to make sure hot pages go to
DRAM and cold pages stay in PMEM, to get the best out of PMEM and DRAM.

- DRAM=>PMEM cold page migration

It can be done in kernel page reclaim path, near the anonymous page
swap out point. Instead of swapping out, we now have the option to
migrate cold pages to PMEM NUMA nodes.

OK, this makes sense to me except I am not sure this is something that
should be pmem specific. Is there any reason why we shouldn't migrate
pages on memory pressure to other nodes in general? In other words
rather than paging out we whould migrate over to the next node that is
not under memory pressure. Swapout would be the next level when the
memory is (almost_) fully utilized. That wouldn't be pmem specific.

In future there could be multi memory levels with different
performance/size/cost metric. There are ongoing HMAT works to describe
that. When ready, we can switch to the HMAT based general infrastructure.
Then the code will no longer be PMEM specific, but do general
promotion/demotion migrations between high/low memory levels.
Swapout could be from the lowest level memory.

Migration between peer nodes is the obvious simple way and a good
choice for the initial implementation. But yeah, it's possible to
migrate to other nodes. For example, it can be combined with NUMA
balancing: if we know the page is mostly accessed by the other socket,
then it'd best to migrate hot/cold pages directly to that socket.

User space may also do it, however cannot act on-demand, when there
are memory pressure in DRAM nodes.

- PMEM=>DRAM hot page migration

While LRU can be good enough for identifying cold pages, frequency
based accounting can be more suitable for identifying hot pages.

Our design choice is to create a flexible user space daemon to drive
the accounting and migration, with necessary kernel supports by this
patchset.

We do have numa balancing, why cannot we rely on it? This along with the
above would allow to have pmem numa nodes (cpuless nodes in fact)
without any special casing and a natural part of the MM. It would be
only the matter of the configuration to set the appropriate distance to
allow reasonable allocation fallback strategy.

Good question. We actually tried reusing NUMA balancing mechanism to
do page-fault triggered migration. move_pages() only calls
change_prot_numa(). It turns out the 2 migration types have different
purposes (one for hotness, another for home node) and hence implement
details. We end up modifying some few NUMA balancing logic -- removing
rate limiting, changing target node logics, etc.

Those look unnecessary complexities for this post. This v2 patchset
mainly fulfills our first milestone goal: a minimal viable solution
that's relatively clean to backport. Even when preparing for new
upstreamable versions, it may be good to keep it simple for the
initial upstream inclusion.

I haven't looked at the implementation yet but if you are proposing a
special cased zone lists then this is something CDM (Coherent Device
Memory) was trying to do two years ago and there was quite some
skepticism in the approach.

It looks we are pretty different than CDM. :)
We creating new NUMA nodes rather than CDM's new ZONE.
The zonelists modification is just to make PMEM nodes more separated.

Thanks,
Fengguang