Re: [PATCH] tracing/osnoise: Force quiescent states while tracing

From: Nicolas Saenz Julienne
Date: Tue Mar 01 2022 - 05:00:19 EST


On Mon, 2022-02-28 at 14:11 -0800, Paul E. McKenney wrote:
> On Mon, Feb 28, 2022 at 03:14:23PM +0100, Nicolas Saenz Julienne wrote:
> > At the moment running osnoise on an isolated CPU and a PREEMPT_RCU
> > kernel might have the side effect of extending grace periods too much.
> > This will eventually entice RCU to schedule a task on the isolated CPU
> > to end the overly extended grace period, adding unwarranted noise to the
> > CPU being traced in the process.
> >
> > So, check if we're the only ones running on this isolated CPU and that
> > we're on a PREEMPT_RCU setup. If so, let's force quiescent states in
> > between measurements.
> >
> > Non-PREEMPT_RCU setups don't need to worry about this as osnoise main
> > loop's cond_resched() will go though a quiescent state for them.
> >
> > Note that this same exact problem is what extended quiescent states were
> > created for. But adapting them to this specific use-case isn't trivial
> > as it'll imply reworking entry/exit and dynticks/context tracking code.
> >
> > Signed-off-by: Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx>
> > ---
> > kernel/trace/trace_osnoise.c | 19 +++++++++++++++++++
> > 1 file changed, 19 insertions(+)
> >
> > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > index 870a08da5b48..4928358f6e88 100644
> > --- a/kernel/trace/trace_osnoise.c
> > +++ b/kernel/trace/trace_osnoise.c
> > @@ -21,7 +21,9 @@
> > #include <linux/uaccess.h>
> > #include <linux/cpumask.h>
> > #include <linux/delay.h>
> > +#include <linux/tick.h>
> > #include <linux/sched/clock.h>
> > +#include <linux/sched/isolation.h>
> > #include <uapi/linux/sched/types.h>
> > #include <linux/sched.h>
> > #include "trace.h"
> > @@ -1295,6 +1297,7 @@ static int run_osnoise(void)
> > struct osnoise_sample s;
> > unsigned int threshold;
> > u64 runtime, stop_in;
> > + unsigned long flags;
> > u64 sum_noise = 0;
> > int hw_count = 0;
> > int ret = -1;
> > @@ -1386,6 +1389,22 @@ static int run_osnoise(void)
> > osnoise_stop_tracing();
> > }
> >
> > + /*
> > + * Check if we're the only ones running on this nohz_full CPU
> > + * and that we're on a PREEMPT_RCU setup. If so, let's fake a
> > + * QS since there is no way for RCU to know we're not making
> > + * use of it.
> > + *
> > + * Otherwise it'll be done through cond_resched().
> > + */
> > + if (IS_ENABLED(CONFIG_PREEMPT_RCU) &&
> > + !housekeeping_cpu(raw_smp_processor_id(), HK_FLAG_MISC) &&
> > + tick_nohz_tick_stopped()) {
> > + local_irq_save(flags);
> > + rcu_momentary_dyntick_idle();
> > + local_irq_restore(flags);
>
> What is supposed to happen in this case is that RCU figures out that
> there is a nohz_full CPU running for an extended period of time in the
> kernel and takes matters into its own hands. This goes as follows on
> a HZ=1000 kernel with default RCU settings:
>
> o At about 20 milliseconds into the grace period, RCU makes
> cond_resched() report quiescent states, among other things.
> As you say, this does not help for CONFIG_PREEMPT=n kernels.
>
> o At about 30 milliseconds into the grace period, RCU forces an
> explicit context switch on the wayward CPU. This should get
> the CPU's attention even in CONFIG_PREEMPT=y kernels.
>
> So what is happening for you instead?

Well, that's exactly what I'm seeing, but it doesn't play well with osnoise.

Here's a simplified view of what the tracer does:

time1 = get_time();
while(1) {
time2 = get_time();
if (time2 - time1 > threshold)
trace_noise();
cond_resched();
time1 = time2;
}

This is pinned to a specific CPU, and in the most extreme cases is expected to
take 100% of CPU time. Eventually, some SMI, NMI/interrupt, or process
execution will trigger the threshold, and osnoise will provide some nice traces
explaining what happened.

RCU forcing a context switch on the wayward CPU is introducing unwarranted
noise as it's triggered by the fact we're measuring and wouldn't happen
otherwise.

If this were user-space, we'd be in an EQS, which would make this problem go
away. An option would be mimicking this behaviour (assuming irq entry/exit code
did the right thing):

rcu_eqs_enter(); <--
time1 = get_time();
while(1) {
time2 = get_time();
if (time2 - time1 > threshold)
trace_noise();
rcu_eqs_exit(); <--
cond_resched();
rcu_eqs_enter(); <--
time1 = time2;
}

But given the tight loop this isn't much different than what I'm proposing at
the moment, isn't it? rcu_momentary_dyntick_idle() just emulates a really fast
EQS entry/exit.

Thanks!

--
Nicolás Sáenz