Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs

From: Paul E. McKenney
Date: Thu Aug 08 2019 - 17:30:22 EST


On Thu, Aug 08, 2019 at 01:35:41PM -0700, Paul E. McKenney wrote:
> On Wed, Aug 07, 2019 at 02:41:31PM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 06, 2019 at 11:08:24AM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > > > > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> > > > >
> > > > > > > Right; so clearly we're not understanding what's happening. That seems
> > > > > > > like a requirement for actually doing a patch.
> > > > > >
> > > > > > Almost but not quite. It is a requirement for a patch *that* *is*
> > > > > > *supposed* *to* *be* *a* *fix*. If you are trying to prohibit me from
> > > > > > writing experimental patches, please feel free to take a long walk on
> > > > > > a short pier.
> > > > > >
> > > > > > Understood???
> > > > >
> > > > > Ah, my bad, I thought you were actually proposing this as an actual
> > > > > patch. I now see that is my bad, I'd overlooked the RFC part.
> > > >
> > > > No problem!
> > > >
> > > > And of course adding tracing decreases the frequency and duration of
> > > > the multi_cpu_stop(). Re-running with shorter-duration triggering. ;-)
> > >
> > > And I did eventually get a good trace. If I am interpreting this trace
> > > correctly, the torture_-135 task didn't get around to attempting to wake
> > > up all of the CPUs. I will try again, but this time with the sched_switch
> > > trace event enabled.
> > >
> > > As a side note, enabling ftrace from the command line seems to interact
> > > badly with turning tracing off and on in the kernel, so I eventually
> > > resorted to trace_printk() in the functions of interest. The trace
> > > output is below, followed by the current diagnostic patch. Please note
> > > that I am -not- using the desperation hammer-the-scheduler patches.
> > >
> > > More as I learn more!
> >
> > And of course I forgot to dump out the online CPUs, so I really had no
> > idea whether or not all the CPUs were accounted for. I added tracing
> > to dump out the online CPUs at the beginning of __stop_cpus() and then
> > reworked it a few times to get the problem to happen in reasonable time.
> > Please see below for the resulting annotated trace.
> >
> > I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> > but all the migration threads are running within about 2 milliseconds.
> > It is then almost two minutes(!) until the next trace message.
> >
> > Looks like time to (very carefully!) instrument multi_cpu_stop().
> >
> > Of course, if you have any ideas, please do not keep them a secret!
>
> Functionally, multi_cpu_stop() is working fine, according to the trace
> below (search for a line beginning with TAB). But somehow CPU 2 took
> almost three -minutes- to do one iteration of the loop. The prime suspect
> in that loop is cpu_relax() due to the hypervisor having an opportunity
> to do something at that point. The commentary below (again, search for
> a line beginning with TAB) gives my analysis.
>
> Of course, if I am correct, it should be possible to catch cpu_relax()
> in the act. That is the next step, give or take the Heisenbuggy nature
> of this beast.
>
> Another thing for me to try is to run longer with !NO_HZ_FULL, just in
> case the earlier runs just got lucky.
>
> Thoughts?

And it really can happen:

[ 1881.467922] migratio-33 4...1 1879530317us : stop_machine_yield: cpu_relax() took 756140 ms

The previous timestamp was 1123391100us, so the cpu_relax() is almost
exactly the full delay.

But another instance stalled for many minutes without a ten-second
cpu_relax(). So it is not just cpu_relax() causing trouble. I could
rationalize that vCPU preemption being at fault...

And my diagnostic patch is below, just in case I am doing something
stupid with that.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ce00b442ced0..1a50ed258ef0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3569,6 +3569,7 @@ void __init rcu_init(void)
rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
WARN_ON(!rcu_par_gp_wq);
srcu_init();
+ tracing_off();
}

#include "tree_stall.h"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b22e55cebe8..a5a879a49051 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -479,6 +479,7 @@ void wake_up_q(struct wake_q_head *head)
{
struct wake_q_node *node = head->first;

+ trace_printk("entered\n");
while (node != WAKE_Q_TAIL) {
struct task_struct *task;

@@ -509,6 +510,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;

+ trace_printk("entered\n");
lockdep_assert_held(&rq->lock);

if (test_tsk_need_resched(curr))
@@ -1197,6 +1199,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
+ trace_printk("entered\n");
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;

@@ -1298,6 +1301,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
const struct sched_class *class;

+ trace_printk("entered\n");
if (p->sched_class == rq->curr->sched_class) {
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
} else {
@@ -2097,6 +2101,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
struct rq_flags *rf)
{
+ trace_printk("entered\n");
check_preempt_curr(rq, p, wake_flags);
p->state = TASK_RUNNING;
trace_sched_wakeup(p);
@@ -2132,6 +2137,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
{
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;

+ trace_printk("entered\n");
lockdep_assert_held(&rq->lock);

#ifdef CONFIG_SMP
@@ -2178,9 +2184,11 @@ void sched_ttwu_pending(void)
struct task_struct *p, *t;
struct rq_flags rf;

+ trace_printk("entered\n");
if (!llist)
return;

+ trace_printk("non-NULL llist\n");
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);

@@ -2192,6 +2200,7 @@ void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
+ trace_printk("entered\n");
/*
* Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
* TIF_NEED_RESCHED remotely (for the first time) will also send
@@ -2232,6 +2241,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);

+ trace_printk("%s entered, CPU %d\n", __func__, cpu);
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
@@ -2277,6 +2287,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
struct rq *rq = cpu_rq(cpu);
struct rq_flags rf;

+ trace_printk("%s entered\n", __func__);
#if defined(CONFIG_SMP)
if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
@@ -2399,6 +2410,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
unsigned long flags;
int cpu, success = 0;

+ trace_printk("entered\n");
preempt_disable();
if (p == current) {
/*
@@ -2545,6 +2557,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
int wake_up_process(struct task_struct *p)
{
+ trace_printk("entered\n");
return try_to_wake_up(p, TASK_NORMAL, 0);
}
EXPORT_SYMBOL(wake_up_process);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 5c2b2f90fae1..a07f77b9c1f2 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
#include <linux/atomic.h>
#include <linux/nmi.h>
#include <linux/sched/wake_q.h>
+#include <linux/sched/clock.h>

/*
* Structure to determine completion condition and record errors. May
@@ -80,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
unsigned long flags;
bool enabled;

+ trace_printk("entered for CPU %u\n", cpu);
preempt_disable();
raw_spin_lock_irqsave(&stopper->lock, flags);
enabled = stopper->enabled;
@@ -167,7 +169,7 @@ static void set_state(struct multi_stop_data *msdata,
/* Reset ack counter. */
atomic_set(&msdata->thread_ack, msdata->num_threads);
smp_wmb();
- msdata->state = newstate;
+ WRITE_ONCE(msdata->state, newstate);
}

/* Last one to ack a state moves to the next state. */
@@ -179,7 +181,15 @@ static void ack_state(struct multi_stop_data *msdata)

void __weak stop_machine_yield(const struct cpumask *cpumask)
{
+ u64 starttime = local_clock();
+ u64 endtime;
+ const u64 delta = 100ULL * 1000ULL * 1000ULL * 1000ULL;
+
cpu_relax();
+ endtime = local_clock();
+ if (time_after64(endtime, starttime + delta))
+ trace_printk("cpu_relax() took %llu ms\n",
+ (endtime - starttime) / (1000ULL * 1000ULL));
}

/* This is the cpu_stop function which stops the CPU. */
@@ -210,8 +220,9 @@ static int multi_cpu_stop(void *data)
do {
/* Chill out and ensure we re-read multi_stop_state. */
stop_machine_yield(cpumask);
- if (msdata->state != curstate) {
- curstate = msdata->state;
+ if (READ_ONCE(msdata->state) != curstate) {
+ curstate = READ_ONCE(msdata->state);
+ trace_printk("curstate = %d, ack = %d\n", curstate, atomic_read(&msdata->thread_ack));
switch (curstate) {
case MULTI_STOP_DISABLE_IRQ:
local_irq_disable();
@@ -382,6 +393,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
* preempted by a stopper which might wait for other stoppers
* to enter @fn which can lead to deadlock.
*/
+ trace_printk("entered\n");
preempt_disable();
stop_cpus_in_progress = true;
for_each_cpu(cpu, cpumask) {
@@ -402,11 +414,18 @@ static int __stop_cpus(const struct cpumask *cpumask,
cpu_stop_fn_t fn, void *arg)
{
struct cpu_stop_done done;
+ unsigned long j = jiffies;

+ tracing_on();
+ trace_printk("entered\n");
+ trace_printk("CPUs %*pbl online\n", cpumask_pr_args(cpu_online_mask));
cpu_stop_init_done(&done, cpumask_weight(cpumask));
if (!queue_stop_cpus_work(cpumask, fn, arg, &done))
return -ENOENT;
wait_for_completion(&done.completion);
+ tracing_off();
+ if (time_after(jiffies, j + HZ * 20))
+ ftrace_dump(DUMP_ALL);
return done.ret;
}

@@ -442,6 +461,7 @@ int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg)
{
int ret;

+ trace_printk("entered\n");
/* static works are used, process one request at a time */
mutex_lock(&stop_cpus_mutex);
ret = __stop_cpus(cpumask, fn, arg);
@@ -599,6 +619,7 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
.active_cpus = cpus,
};

+ trace_printk("entered\n");
lockdep_assert_cpus_held();

if (!stop_machine_initialized) {