Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters

From: Mel Gorman
Date: Wed Apr 14 2021 - 11:18:57 EST


On Wed, Apr 14, 2021 at 02:56:45PM +0200, Vlastimil Babka wrote:
> On 4/7/21 10:24 PM, Mel Gorman wrote:
> > NUMA statistics are maintained on the zone level for hits, misses, foreign
> > etc but nothing relies on them being perfectly accurate for functional
> > correctness. The counters are used by userspace to get a general overview
> > of a workloads NUMA behaviour but the page allocator incurs a high cost to
> > maintain perfect accuracy similar to what is required for a vmstat like
> > NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
> > turn off the collection of NUMA statistics like NUMA_HIT.
> >
> > This patch converts NUMA_HIT and friends to be NUMA events with similar
> > accuracy to VM events. There is a possibility that slight errors will be
> > introduced but the overall trend as seen by userspace will be similar.
> > Note that while these counters could be maintained at the node level that
> > it would have a user-visible impact.
>
> I guess this kind of inaccuracy is fine. I just don't like much
> fold_vm_zone_numa_events() which seems to calculate sums of percpu counters and
> then assign the result to zone counters for immediate consumption, which differs
> from other kinds of folds in vmstat that reset the percpu counters to 0 as they
> are treated as diffs to the global counters.
>

The counters that are diffs fit inside an s8 and they are kept limited
because their "true" value is sometimes critical -- e.g. NR_FREE_PAGES
for watermark checking. So the level of drift has to be controlled and
the drift should not exist potentially forever so it gets updated
periodically.

The inaccurate counters are only exported to userspace. There is no need
to update them every few seconds so fold_vm_zone_numa_events() is only
called when a user cares but you raise a raise a valid below.

> So it seems that this intermediate assignment to zone counters (using
> atomic_long_set() even) is unnecessary and this could mimic sum_vm_events() that
> just does the summation on a local array?
>

The atomic is unnecessary for sure but using a local array is
problematic because of your next point.

> And probably a bit more serious is that vm_events have vm_events_fold_cpu() to
> deal with a cpu going away, but after your patch the stats counted on a cpu just
> disapepar from the sums as it goes offline as there's no such thing for the numa
> counters.
>

That is a problem I missed. Even if zonestats was preserved on
hot-remove, fold_vm_zone_numa_events would not be reading the CPU so
hotplug events jump all over the place.

So some periodic folding is necessary. I would still prefer not to do it
by time but it could be done only on overflow or when a file like
/proc/vmstat is read. I'll think about it a bit more and see what I come
up with.

Thanks!

--
Mel Gorman
SUSE Labs