Re: [PATCH] sched/numa: advanced per-cgroup numa statistic

From: çè
Date: Fri Nov 01 2019 - 07:52:23 EST




On 2019/11/1 äå5:13, Mel Gorman wrote:
[snip]
>> For example in our cases, we could have hundreds of cgroups, each contain
>> hundreds of tasks, these worker thread could live and die at any moment,
>> for gathering we need to cat the list of tasks and then go reading these proc
>> files one by one, which fall into kernel rapidly and may even need to holding
>> some locks, this introduced big latency impact, and give no accurate output
>> since some task may already died before reading it's data.
>>
>> Then before next sample window, info of tasks died during the window can't be
>> acquired anymore.
>>
>> We need kernel's help on reserving data since tool can't catch them in time
>> before they are lost, also we have to avoid rapidly proc reading, which really
>> cost a lot and further more, introduce big latency on each sample window.
>>
>
> There is somewhat of a disconnect here. You say that the information must
> be accurate and historical yet are relying on NUMA hinting faults to build
> the picture which may not be accurate at all given that faults are not
> guaranteed to happen. For short-lived tasks, it is also potentially skewed
> information if short-lived tasks dominated remote accesses for whatever
> reason even though it does not matter -- the tasks were short-lived and
> their performance is probably irrelevant. Short-lived tasks may not even
> show up if they do not run longer than sysctl_numa_balancing_scan_delay
> so the data gathered already has holes in it.
>
> While it's a bit more of a stretch, even this could still be done from
> userspace if numa_hint_fault was probed and the event handled (eBPF,
> systemtap etc) to build the picture or add a tracepoint. That would give
> a much higher degree of flexibility on what information is tracked and
> allow flexibility on
>
> So, overall I think this can be done outside the kernel but recognise
> that it may not be suitable in all cases. If you feel it must be done
> inside the kernel, split out the patch that adds information on failed
> page migrations as it stands apart. Put it behind its own kconfig entry
> that is disabled by default -- do not tie it directly to NUMA balancing
> because of the data structure changes. When enabled, it should still be
> disabled by default at runtime and only activated via kernel command line
> parameter so that the only people who pay the cost are those that take
> deliberate action to enable it.

Agree, we could have these per-task faults info there, give the possibility
to implement maybe a practical userland tool, meanwhile have these kernel
numa data disabled by default, folks who got no tool but want to do easy
monitoring can just turn on the switch :-)

Will have these in next version:

* separate patch for showing per-task faults info
* new CONFIG for numa stat (disabled by default)
* dynamical runtime switch for numa stat (disabled by default)
* doc to explain the numa stat and give hint on how to handle it

Best Regards,
Michale Wang

>