Re: [RFC][PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu online

From: Alexey Klimov
Date: Tue Jan 19 2021 - 11:56:41 EST


On Fri, Jan 15, 2021 at 6:54 AM Daniel Jordan
<daniel.m.jordan@xxxxxxxxxx> wrote:
>
> Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> writes:
> > Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
> >>> The nature of this bug is also described here (with different consequences):
> >>> https://lore.kernel.org/lkml/20200211141554.24181-1-qais.yousef@xxxxxxx/
> >>
> >> Yeah, pesky deadlocks.. someone was going to try again.
> >
> > I dug up the synchronous patch
> >
> > https://lore.kernel.org/lkml/1579878449-10164-1-git-send-email-prsood@xxxxxxxxxxxxxx/
> >
> > but surprisingly wasn't able to reproduce the lockdep splat from
> >
> > https://lore.kernel.org/lkml/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@xxxxxx/
> >
> > even though I could hit it a few weeks ago.
>
> oh okay, you need to mount a legacy cpuset hierarchy.
>
> So as the above splat shows, making cpuset_hotplug_workfn() synchronous
> means cpu_hotplug_lock (and "cpuhp_state-down") can be acquired before
> cgroup_mutex.
>
> But there are at least four cgroup paths that take the locks in the
> opposite order. They're all the same, they take cgroup_mutex and then
> cpu_hotplug_lock later on to modify one or more static keys.
>
> cpu_hotplug_lock should probably be ahead of cgroup_mutex because the
> latter is taken in a hotplug callback, and we should keep the static
> branches in cgroup, so the only way out I can think of is moving
> cpu_hotplug_lock to just before cgroup_mutex is taken and switching to
> _cpuslocked flavors of the static key calls.
>
> lockdep quiets down with that change everywhere, but it puts another big
> lock around a lot of cgroup paths. Seems less heavyhanded to go with
> this RFC. What do you all think?

Daniel, thank you for taking a look. I don't mind reviewing+testing
another approach that you described.

> Absent further discussion, Alexey, do you plan to post another version?

I plan to update this patch and re-send in the next couple of days. It
looks like it might be a series of two patches. Sorry for delays.

Best regards,
Alexey