Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies

From: Marcelo Tosatti
Date: Tue Jan 10 2023 - 15:11:56 EST


On Tue, Jan 10, 2023 at 02:39:08PM +0100, Christoph Lameter wrote:
> On Tue, 10 Jan 2023, Frederic Weisbecker wrote:
>
> > Note I'm absolutely clueless with vmstat. But I was wondering about it as well
> > while reviewing Marcelo's series, so git blame pointed me to:
> >
> > 7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid
> > interrupt disable / enable")
> >
> > And this seem to mention that this can race with IRQs as well, hence the local
> > cmpxchg operation.
>
> The race with irq could be an issue but I thought we avoided that and were
> content with disabling preemption.
>
> But this issue illustrates the central problem of the patchset: It makes
> the lightweight counters not so lightweight anymore.

https://lkml.iu.edu/hypermail/linux/kernel/0903.2/00569.html

With added

static void do_test_preempt(void)
{
unsigned long flags;
unsigned int i;
cycles_t time1, time2, time;
u32 rem;

local_irq_save(flags);
preempt_disable();
time1 = get_cycles();
for (i = 0; i < NR_LOOPS; i++) {
preempt_disable();
preempt_enable();
}
time2 = get_cycles();
local_irq_restore(flags);
preempt_enable();
time = time2 - time1;

printk(KERN_ALERT "test results: time for disabling/enabling preemption\n");
printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
printk(KERN_ALERT "total time: %llu\n", time);
time = div_u64_rem(time, NR_LOOPS, &rem);
printk(KERN_ALERT "-> enabling/disabling preemption takes %llu cycles\n",
time);
printk(KERN_ALERT "test end\n");
}


model name : 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz

[ 423.676079] test init
[ 423.676249] test results: time for baseline
[ 423.676405] number of loops: 200000
[ 423.676676] total time: 104274
[ 423.676910] -> baseline takes 0 cycles
[ 423.677051] test end
[ 423.678150] test results: time for locked cmpxchg
[ 423.678353] number of loops: 200000
[ 423.678498] total time: 2473839
[ 423.678630] -> locked cmpxchg takes 12 cycles
[ 423.678810] test end
[ 423.679204] test results: time for non locked cmpxchg
[ 423.679394] number of loops: 200000
[ 423.679527] total time: 740298
[ 423.679644] -> non locked cmpxchg takes 3 cycles
[ 423.679817] test end
[ 423.680755] test results: time for locked add return
[ 423.680951] number of loops: 200000
[ 423.681089] total time: 2118185
[ 423.681229] -> locked add return takes 10 cycles
[ 423.681411] test end
[ 423.681846] test results: time for enabling interrupts (STI)
[ 423.682063] number of loops: 200000
[ 423.682209] total time: 861591
[ 423.682335] -> enabling interrupts (STI) takes 4 cycles
[ 423.682532] test end
[ 423.683606] test results: time for disabling interrupts (CLI)
[ 423.683852] number of loops: 200000
[ 423.684006] total time: 2440756
[ 423.684141] -> disabling interrupts (CLI) takes 12 cycles
[ 423.684588] test end
[ 423.686626] test results: time for disabling/enabling interrupts (STI/CLI)
[ 423.686879] number of loops: 200000
[ 423.687015] total time: 4802297
[ 423.687139] -> enabling/disabling interrupts (STI/CLI) takes 24 cycles
[ 423.687389] test end
[ 423.688025] test results: time for disabling/enabling preemption
[ 423.688258] number of loops: 200000
[ 423.688396] total time: 1341001
[ 423.688526] -> enabling/disabling preemption takes 6 cycles
[ 423.689276] test end

> The basic primitives add a lot of weight.

Can't see any alternative given the necessity to avoid interruption
by the work to sync per-CPU vmstats to global vmstats.

> And the pre cpu atomic updates operations require the modification
> of multiple values. The operation
> cannot be "atomic" in that sense anymore and we need some other form of
> synchronization that can
> span multiple instructions.

So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
count on preremption being disabled we still have some minor issues.
The fetching of the counter thresholds is racy.
A threshold from another cpu may be applied if we happen to be
rescheduled on another cpu. However, the following vmstat operation
will then bring the counter again under the threshold limit.

Those small issues are gone, OTOH.