Re: [RFC PATCH V1 1/1] sched/numa: Enhance vma scanning logic

From: Kalra, Ashish
Date: Mon Feb 20 2023 - 19:38:22 EST


Hello Mingwei, Sean,

Looking forward to your thoughts/feedback on the MMU invalidation notifier issues with SEV guests as mentioned below ?

Thanks,
Ashish

On 1/17/2023 10:43 PM, Bharata B Rao wrote:
On 1/17/2023 8:29 PM, Mel Gorman wrote:
Note that the cc list is excessive for the topic.

(Wasn't sure about pruning the CC list mid-thread, hence continuing with it)

<snip>


This is a build-tested only prototype to illustrate how VMA could track
NUMA balancing state. It starts with applying the scan delay to every VMA
instead of every task to avoid scanning new or very short-lived VMAs. I
went back to my old notes on how I hoped to reduce excessive scanning in
NUMA balancing and it happened to be second on my list and straight-forward
to prototype in a few minutes.

While on the topic of improving NUMA balancer scanning relevancy, the following
additional points may be worth noting:

Recently there have been reports about NUMA balancing induced scanning and
subsequent MMU notifier invalidations causing problems in different scenarios.

1. Currently NUMA balancing won't check at scan time, if a page (or a VMA )is
not migratable since the page (or the address range) is pinned. It will go ahead
with MMU invalidation notifications and changes the PTE protection to PAGE_NONE
only to realize later that the pinned pages can't be migrated before reinstalling
the original PTE.

This was found to cause issues to SEV guests whose pages are completely pinned.
This was discussed here - https://lore.kernel.org/all/20220927000729.498292-1-Ashish.Kalra@xxxxxxx/

We could probably use page_maybe_dma_pinned() to determine if the page is long
term pinned and avoid MMU invalidation and protection change for such a page.
However then we would have to do per-page invalidations (as against one time
PMD range invalidation that is done currently) which is probably not desirable.

Also MMU invalidations are expected to be issued under sleepable context (mostly
except in the OOM notification which uses nonblock verion, AFAICS). This makes it
difficult to check the pinned state of the page prior to MMU invalidation. Some of
this is discussed here: https://lore.kernel.org/linux-arm-kernel/YuEMkKY2RU%2F2KiZW@monolith.localdoman/

This current patchset where we attempt to restrict scanning to relevant VMAs may
help the above case partially, but any ideas on addressing this issue
comprehensively? It would have been ideal if we could identify such non-migratable
pages (long term pinned) clearly and avoid them entirely from scanning and protection
change.

2. Applications that run on GPUs may like to avoid the NUMA balancing activity
completely and they benefit from per-process enabling/disabling of NUMA balancing.
The patchset (which has a different use case for per-process control) that helps
this is here - https://lore.kernel.org/all/49ed07b1-e167-7f94-9986-8e86fb60bb09@xxxxxxxxxx/

Improvements to increase the relevant scanning can help this case to an extent
but per-process NUMA balancing control should be a useful control to have.

Regards,
Bharata.