Re: [PATCH v7 2/3] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too

From: Frederic Weisbecker
Date: Fri Sep 09 2022 - 08:16:18 EST


On Wed, Aug 17, 2022 at 04:13:48PM -0300, Marcelo Tosatti wrote:
> From: Aaron Tomlin <atomlin@xxxxxxxxxx>
>
> In the context of the idle task and an adaptive-tick mode/or a nohz_full
> CPU, quiet_vmstat() can be called: before stopping the idle tick,
> entering an idle state and on exit. In particular, for the latter case,
> when the idle task is required to reschedule, the idle tick can remain
> stopped

Since quiet_vmstat() is only called when ts->tick_stopped = false, this
can only happen if the idle loop did not enter into dynticks idle mode
but the exiting idle task eventually stops the tick
(tick_nohz_idle_update_tick()).

This can happen for example if we enter the idle loop with a timer callback
pending in one jiffies, then once that timer fires, which wakes up a task,
we exit the idle loop and then tick_nohz_idle_update_tick() doesn't see any
timer callback pending left and the tick can be stopped.

Or am I missing something?

> and the timer expiration time endless i.e., KTIME_MAX. Now,
> indeed before a nohz_full CPU enters an idle state, CPU-specific vmstat
> counters should be processed to ensure the respective values have been
> reset and folded into the zone specific 'vm_stat[]'. That being said, it
> can only occur when: the idle tick was previously stopped, and
> reprogramming of the timer is not required.
>
> A customer provided some evidence which indicates that the idle tick was
> stopped; albeit, CPU-specific vmstat counters still remained populated.
> Thus one can only assume quiet_vmstat() was not invoked on return to the
> idle loop.
>
> If I understand correctly, I suspect this divergence might erroneously
> prevent a reclaim attempt by kswapd. If the number of zone specific free
> pages are below their per-cpu drift value then
> zone_page_state_snapshot() is used to compute a more accurate view of
> the aforementioned statistic. Thus any task blocked on the NUMA node
> specific pfmemalloc_wait queue will be unable to make significant
> progress via direct reclaim unless it is killed after being woken up by
> kswapd (see throttle_direct_reclaim()).
>
> Consider the following theoretical scenario:
>
> 1. CPU Y migrated running task A to CPU X that was
> in an idle state i.e. waiting for an IRQ - not
> polling; marked the current task on CPU X to
> need/or require a reschedule i.e., set
> TIF_NEED_RESCHED and invoked a reschedule IPI to
> CPU X (see sched_move_task())

CPU Y is nohz_full right?

>
> 2. CPU X acknowledged the reschedule IPI from CPU Y;
> generic idle loop code noticed the
> TIF_NEED_RESCHED flag against the idle task and
> attempts to exit of the loop and calls the main
> scheduler function i.e. __schedule().
>
> Since the idle tick was previously stopped no
> scheduling-clock tick would occur.
> So, no deferred timers would be handled
>
> 3. Post transition to kernel execution Task A
> running on CPU Y, indirectly released a few pages
> (e.g. see __free_one_page()); CPU Y's
> 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
> specific 'vm_stat[]' update was deferred as per the
> CPU-specific stat threshold
>
> 4. Task A does invoke exit(2) and the kernel does
> remove the task from the run-queue; the idle task
> was selected to execute next since there are no
> other runnable tasks assigned to the given CPU
> (see pick_next_task() and pick_next_task_idle())

This happens on CPU X, right?

>
> 5. On return to the idle loop since the idle tick
> was already stopped and can remain so (see [1]
> below) e.g. no pending soft IRQs, no attempt is
> made to zero and fold CPU Y's vmstat counters
> since reprogramming of the scheduling-clock tick
> is not required/or needed (see [2])

And now back to CPU Y, confused...

[...]
> Index: linux-2.6/kernel/time/tick-sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/time/tick-sched.c
> +++ linux-2.6/kernel/time/tick-sched.c
> @@ -26,6 +26,7 @@
> #include <linux/posix-timers.h>
> #include <linux/context_tracking.h>
> #include <linux/mm.h>
> +#include <linux/rcupdate.h>
>
> #include <asm/irq_regs.h>
>
> @@ -519,6 +520,20 @@ void __tick_nohz_task_switch(void)
> }
> }
>
> +void __tick_nohz_user_enter_prepare(void)
> +{
> + struct tick_sched *ts;
> +
> + if (tick_nohz_full_cpu(smp_processor_id())) {
> + ts = this_cpu_ptr(&tick_cpu_sched);
> +
> + if (ts->tick_stopped)
> + quiet_vmstat();

Wasn't it supposed to be part of the quiescing in task isolation
mode?

Because currently vmstat is a deferrable timer but that deferrability
may not apply to nohz_full anymore (outside idle). And quiet_vmstat()
doesn't cancel the timer so you'll still get the disturbance.

See this patch: https://lore.kernel.org/lkml/20220725104356.GA2950296@lothringen/

> + rcu_nocb_flush_deferred_wakeup();
> + }
> +}
>
> +EXPORT_SYMBOL_GPL(__tick_nohz_user_enter_prepare);
> +
> /* Get the boot-time nohz CPU list from the kernel parameters. */
> void __init tick_nohz_full_setup(cpumask_var_t cpumask)
> {
> @@ -890,6 +905,9 @@ static void tick_nohz_stop_tick(struct t
> ts->do_timer_last = 0;
> }
>
> + /* Attempt to fold when the idle tick is stopped or not */
> + quiet_vmstat();
> +
> /* Skip reprogram of event if its not changed */
> if (ts->tick_stopped && (expires == ts->next_tick)) {
> /* Sanity check: make sure clockevent is actually programmed */

But that chunk looks good.

Thanks.

> @@ -911,7 +929,6 @@ static void tick_nohz_stop_tick(struct t
> */
> if (!ts->tick_stopped) {
> calc_load_nohz_start();
> - quiet_vmstat();
>
> ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
> ts->tick_stopped = 1;
>
>