Re: [PATCH] sched/numa: advanced per-cgroup numa statistic

From: Mel Gorman
Date: Fri Nov 01 2019 - 05:13:58 EST


On Fri, Nov 01, 2019 at 09:49:20AM +0800, ?????? wrote:
>
>
> On 2019/10/31 ??????9:17, Mel Gorman wrote:
> > On Thu, Oct 31, 2019 at 11:31:24AM +0800, ?????? wrote:
> [snip]
> >> For example, tasks bind to the cpus of node_0 could have their memory on
> >> node_1 when node_0 almost run out of memory, numa balancing may not be
> >> able to help in this case, while by reading locality we could know how
> >> critical the problem is, and may take action to rebind cpus to node_1 or
> >> reclaim the memory of node_0.
> >>
> >
> > You can already do this by walking each cgroup, identifying what tasks are
> > in it and look at /proc/PID/numa_maps and /proc/PID/status to see what
> > CPU bindings if any exist. This would use the actual memory placements
> > and not those sampled by NUMA balancing, would not require NUMA balancing
> > and would work on older kernels. It would be costly to access so I would
> > not suggest doing it at high frequency but it makes sense for the tool
> > that cares to pay the cost instead of spreading tidy bits of cost to
> > every task whether there is an interested tool or not.
>
> I see the biggest concern now is the necessity to let kernel providing these
> data, IMHO there are actually good reasons here:
> * there are too many tasks to gathering data from, reading proc cost a lot
> * tasks living and dying, data lost between each sample window
>
> For example in our cases, we could have hundreds of cgroups, each contain
> hundreds of tasks, these worker thread could live and die at any moment,
> for gathering we need to cat the list of tasks and then go reading these proc
> files one by one, which fall into kernel rapidly and may even need to holding
> some locks, this introduced big latency impact, and give no accurate output
> since some task may already died before reading it's data.
>
> Then before next sample window, info of tasks died during the window can't be
> acquired anymore.
>
> We need kernel's help on reserving data since tool can't catch them in time
> before they are lost, also we have to avoid rapidly proc reading, which really
> cost a lot and further more, introduce big latency on each sample window.
>

There is somewhat of a disconnect here. You say that the information must
be accurate and historical yet are relying on NUMA hinting faults to build
the picture which may not be accurate at all given that faults are not
guaranteed to happen. For short-lived tasks, it is also potentially skewed
information if short-lived tasks dominated remote accesses for whatever
reason even though it does not matter -- the tasks were short-lived and
their performance is probably irrelevant. Short-lived tasks may not even
show up if they do not run longer than sysctl_numa_balancing_scan_delay
so the data gathered already has holes in it.

While it's a bit more of a stretch, even this could still be done from
userspace if numa_hint_fault was probed and the event handled (eBPF,
systemtap etc) to build the picture or add a tracepoint. That would give
a much higher degree of flexibility on what information is tracked and
allow flexibility on

So, overall I think this can be done outside the kernel but recognise
that it may not be suitable in all cases. If you feel it must be done
inside the kernel, split out the patch that adds information on failed
page migrations as it stands apart. Put it behind its own kconfig entry
that is disabled by default -- do not tie it directly to NUMA balancing
because of the data structure changes. When enabled, it should still be
disabled by default at runtime and only activated via kernel command line
parameter so that the only people who pay the cost are those that take
deliberate action to enable it.

--
Mel Gorman
SUSE Labs