Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order

From: Huang, Ying
Date: Mon Jun 21 2021 - 21:14:36 EST


Zi Yan <ziy@xxxxxxxxxx> writes:

> On 19 Jun 2021, at 4:18, Huang, Ying wrote:
>
>> Zi Yan <ziy@xxxxxxxxxx> writes:
>>
>>> On 18 Jun 2021, at 2:15, Huang Ying wrote:

[snip]

>>>> +/*
>>>> + * When memory fills up on a node, memory contents can be
>>>> + * automatically migrated to another node instead of
>>>> + * discarded at reclaim.
>>>> + *
>>>> + * Establish a "migration path" which will start at nodes
>>>> + * with CPUs and will follow the priorities used to build the
>>>> + * page allocator zonelists.
>>>> + *
>>>> + * The difference here is that cycles must be avoided. If
>>>> + * node0 migrates to node1, then neither node1, nor anything
>>>> + * node1 migrates to can migrate to node0.
>>>> + *
>>>> + * This function can run simultaneously with readers of
>>>> + * node_demotion[]. However, it can not run simultaneously
>>>> + * with itself. Exclusion is provided by memory hotplug events
>>>> + * being single-threaded.
>>>> + */
>>>> +static void __set_migration_target_nodes(void)
>>>> +{
>>>> + nodemask_t next_pass = NODE_MASK_NONE;
>>>> + nodemask_t this_pass = NODE_MASK_NONE;
>>>> + nodemask_t used_targets = NODE_MASK_NONE;
>>>> + int node;
>>>> +
>>>> + /*
>>>> + * Avoid any oddities like cycles that could occur
>>>> + * from changes in the topology. This will leave
>>>> + * a momentary gap when migration is disabled.
>>>> + */
>>>> + disable_all_migrate_targets();
>>>> +
>>>> + /*
>>>> + * Ensure that the "disable" is visible across the system.
>>>> + * Readers will see either a combination of before+disable
>>>> + * state or disable+after. They will never see before and
>>>> + * after state together.
>>>> + *
>>>> + * The before+after state together might have cycles and
>>>> + * could cause readers to do things like loop until this
>>>> + * function finishes. This ensures they can only see a
>>>> + * single "bad" read and would, for instance, only loop
>>>> + * once.
>>>> + */
>>>> + smp_wmb();
>>>> +
>>>> + /*
>>>> + * Allocations go close to CPUs, first. Assume that
>>>> + * the migration path starts at the nodes with CPUs.
>>>> + */
>>>> + next_pass = node_states[N_CPU];
>>>
>>> Is there a plan of allowing user to change where the migration
>>> path starts? Or maybe one step further providing an interface
>>> to allow user to specify the demotion path. Something like
>>> /sys/devices/system/node/node*/node_demotion.
>>
>> I don't think that's necessary at least for now. Do you know any real
>> world use case for this?
>
> In our P9+volta system, GPU memory is exposed as a NUMA node.
> For the GPU workloads with data size greater than GPU memory size,
> it will be very helpful to allow pages in GPU memory to be migrated/demoted
> to CPU memory. With your current assumption, GPU memory -> CPU memory
> demotion seems not possible, right? This should also apply to any
> system with a device memory exposed as a NUMA node and workloads running
> on the device and using CPU memory as a lower tier memory than the device
> memory.

Thanks a lot for your use case! It appears that the demotion path
specified by users is one possible way to satisfy your requirement. And
I think it's possible to enable that on top of this patchset. But we
still have no specific plan to work on that at least for now.

Best Regards,
Huang, Ying