Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing

From: Bharata B Rao
Date: Thu Feb 09 2023 - 00:57:59 EST


On 2/8/2023 11:33 PM, Peter Zijlstra wrote:
> On Wed, Feb 08, 2023 at 01:05:28PM +0530, Bharata B Rao wrote:
>
>> - Perf uses IBS and we are using the same IBS for access profiling here.
>> There needs to be a proper way to make the use mutually exclusive.
>
> No, IFF this lives it needs to use in-kernel perf.

In fact I started out with in-kernel perf by using the
perf_event_create_kernel_counter() API. However there are issues
with using in-kernel perf:

- We want to reprogram the counter potentially during every
context switch. The IBS hardware sample counter needs to be
reprogrammed based on the incoming thread's view of sample period.
Additionally sampling needs to be disabled for kernel threads.
So I wanted to use perf_event_enable/disable() and
perf_event_period(). However they take mutexes and hence it is
not possible to use them from the sched switch atomic context.

- In-kernel perf gives a per-cpu counter, but we want it to count
based on the task that is currently running. I,e., the period should
be modified on per-task basis. I don't see how an in-kernel perf
event counter can be associated with per-task like this.

Hence I didn't see an easy option other than making the use
of IBS in perf and NUMA balancing mutually exclusive.

>
>> - Is tying this up with NUMA balancing a reasonable approach or
>> should we look at a completely new approach?
>
> Is it giving sufficient win to be worth it, afaict it doesn't come even
> close to justifying it.
>
>> - Hardware provided access information could be very useful for driving
>> hot page promotion in tiered memory systems. Need to check if this
>> requires different tuning/heuristics apart from what NUMA balancing
>> already does.
>
> I think Huang Ying looked at that from the Intel POV and I think the
> conclusion was that it doesn't really work out. What you need is
> frequency information, but the PMU doesn't really give you that. You
> need to process a *ton* of PMU data in-kernel.

What I am doing here is to feed the access data into NUMA balancing which
already has the logic to aggregate that at task and numa group level and
decide if that access is actionable in terms of migrating the page. In this
context, I am not sure about the frequency information that you and Dave
are mentioning. AFAIU, existing NUMA balancing takes care of taking
action, IBS becomes an alternative source of access information to NUMA
hint faults.

Thanks for your inputs.

Regards,
Bharata.