Re: [PATCH 2/2] mm/vmstat: Protect per cpu variables with preempt disable on RT

From: Mel Gorman
Date: Wed Aug 04 2021 - 10:23:32 EST


On Wed, Aug 04, 2021 at 03:42:25PM +0200, Vlastimil Babka wrote:
> > Each counter
> > would need to be audited and two question are asked
> >
> > o If this counter is inaccurate, does anything break?
> > o If this counter is inaccurate, does it both increment and decrement
> > allowing the possibility it goes negative?
> >
> > The decision on that is looking at the counter and seeing if any
> > functional decision is made based on its value. So two examples;
> >
> > NR_VMSCAN_IMMEDIATE is a node-based counter that only every
> > increments and is reported to userspace. No kernel code makes
> > any decision based on its value. Therefore it's likely safe to
> > move to numa_stat_item instead.
> >
> > Action: move it
> >
> > WORKINGSET_ACTIVATE_FILE is a node-based counter that is used to
> > determine if a mem cgroup is potentially congested by looking at
> > the ratio of cgroup to node refault rates as well as deciding if
> > LRU file pages should be deactivate. If that value drifts, the
> > ratios are miscalculated and could lead to functional oddities
> > and therefore must be accurate.
> >
> > Action: leave it alone
> >
> > I guess it could be further split into state that must be accurate from
> > IRQ and non-IRQ contexts but that probably would be very fragile and
> > offer limited value.
> >
> >> Brilliant stuff which prevents you to do any validation on this. Over
> >> the years there have been several issues where callers had to be fixed
> >> by analysing bug reports instead of having a proper instrumentation in
> >> that code which would have told the developer that he got it wrong.
> >>
> >
> > I'm not sure it could be validated at build-time but I'm just back from
> > holiday and may be lacking imagination.
>
> The idea was not build-time, but runtime (hidden behind lockdep, VM_DEBUG or
> whatnot), i.e.:
>
> <sched_expert> what that code needs is switch(item) { case foo1: case foo2:
> lockdep_assert_irqs_disabled(); break; case bar1: case bar2:
> lockdep_assert_preempt_disabled(); lockdep_assert_no_in_irq(); break; } or
> something along those lines
>

Ok, that would potentially work. It may not even need to split the stats
into different enums. Simply document which stats need protection from
IRQ or preemption and use PROVE_LOCKING to check if preemption or IRQs
are disabled depending on the kernel config. I don't think it gets rid
of preempt_disable_rt unless the API was completely reworked with entry
points that describe the locking requirements. That would be tricky
because the requirements differ between kernel configurations.

This would be independent of moving stats that do not need to be 100%
accurate to the inaccurate-but-fast API.

--
Mel Gorman
SUSE Labs