Re: [PATCH v2 6/6] stop_machine: kill stop_cpus_lock and lg_double_lock/unlock()

From: Oleg Nesterov
Date: Mon Aug 03 2015 - 11:00:43 EST


On 07/30, Peter Zijlstra wrote:
>
> On Tue, Jul 21, 2015 at 09:22:47PM +0200, Oleg Nesterov wrote:
>
> > + err = -EDEADLK;
> > + if (stop_work_pending(stopper1) != stop_work_pending(stopper2))
> > + goto unlock;
>
> You could DoS/false positive this by running stop_one_cpu() in a loop,
> and thereby 'always' having work pending on one but not the other.

as we already discussed this is not a problem.

> > + if (unlikely(err == -EDEADLK)) {
> > + cond_resched();
> > + goto retry;
>
> And this just gives me -rt nightmares.

Why?

> As it is, -rt does horrible things to stop_machine, and I would very
> much like to make it such that we don't need to do that.
>
> Now, obviously, stop_cpus() is _BAD_ for -rt, and we try real hard to
> make sure that doesn't happen,

Yes. stop_cpus() is already bad so I am not sure I understand why this
change make the things really worse.

stop_two_cpus() needs to spin/retry if it races with the main loop in
queue_stop_cpus_work(),

preempt_disable();
for_each_cpu(cpu, cpumask) {
work = &per_cpu(cpu_stopper.stop_work, cpu);
work->fn = fn;
work->arg = arg;
work->done = done;
cpu_stop_queue_work(cpu, work);
}
preempt_enable();

and iirc preempt_disable() means "disable preemption" even in -rt, but
I am not sure. So "goto retry" should be really, really unlikely.

Besides, whatever we do stop_two_cpus(X, Y) will wait anyway if ->stop_work
was queued on X or Y anyway. And with your patch in the next email it will
spin too (yes, yes, -rt differs).


Another case when stop_two_cpus(X, Y) needs to retry is when ->stop_work
was already dequeued on CPU X but not on CPU Y (and this is why it needs
cond_resched() for CONFIG_PREEMPT=n, it can run on CPU Y). This does not
look really bad too, the migration/Y thread is already activated and it
has the highest priority.


So I still think that at least correctness wise this patch is fine. Am I
missed something else?


> Paul's RCU branch already kills try_stop_cpus() dead, so that wart is
> also gone. But we're still stuck with stop_machine_from_inactive_cpu()
> which does a spin-wait for exclusive state. So I suppose we'll have to
> keep stop_cpus_mutex :/

Yes, we still need stop_cpus_mutex. Even if we remove try_stop_cpus() and
stop_machine_from_inactive_cpu(). But this is another issue.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/