Bad hotplug/scheduler interaction?

From: Rick Lindsley
Date: Thu Sep 13 2007 - 03:52:42 EST


I'm concerned that we don't have adequate protection for the scheduler
during cpu hotplug events, but I'm willing to believe I simply don't
understand the mechanism well enough. We had a crash in (comparatively
ancient) 2.6.16.* but I think the relevant code is basically unchanged
since then.

First we introduced some cpu-intensive workloads. Then we added two cpus.
System quickly crashed. The crash was in find_busiest_group(), when
the kernel tried to access "this", which was NULL. If we don't find a
localgroup, we won't set this, and when we try to calculate *imbalance,
we'll dereference a NULL "this" and crash.

As I looked over the code, though, I couldn't tell if the fault was with
find_busiest_group() for not covering this case, or if the problem was
that the method the hotplug code is using to reconstruct the sched_domains
really doesn't protect find_busiest_group (and find_idlest_group) at all.
Can anybody explain how synchronize_sched() is really syncing? It looks
like a half-implemented RCU setup. I fear we really don't have any way
to protect the two functions above from hotplug's desire to twiddle
with the sched_domains.

Do we?

Rick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/