Re: current linux-2.6.git: cpusets completely broken

From: Linus Torvalds
Date: Sun Jul 13 2008 - 13:11:47 EST




On Sun, 13 Jul 2008, Dmitry Adamushko wrote:

> And let me explain one last time why I opposed your 'cpu_active_map' approach.

And let me explain why you are totally off base.

> I do agree that there are likely ways to optimize the hotplug
> machinery [ .. deleted rambling .. ]

This has *NOTHING* to do with optimizing any hotplug machinery.

> The current way to synchronize with the load-balancer is to attach
> NULL domains [ .. deleted more ramblings .. ]

This has *NOTHING* to do even with cpusets and scheduler domains!

Until you can understand that, all your arguments are total and utter
CRAP.

So Dmitry - please follow along, and think this through.

This is a *fundamental* scheduler issue. It has nothing what-so-ever to do
with optimization, and it has nothing to do with cpusets. It's about the
fact that we migrate threads from one CPU to another - and we do that
whether cpusets are even enabled or not!

And anything that uses "cpu_active_map" to decide if the migration target
is alive is simply _buggy_.

See? Not "un-optimized". Not "cpusets". Just pure scheduling and hotplug
issues with taking a CPU down.

As long as you continue to only look at wake_idle() and scheduler domains,
you are missing all the *other* cases of migration. Like the one we do at
execve() time, or in balance_task.

The thing is, we should fix the top level code to never even _consider_ an
invalid CPU as a target, and that in turn should mean that all the other
code should be able to just totally ignore CPU hotplug events.

In other words, it vey fundamentally SHOULD NOT MATTER that somebody
happened to call "try_to_wake_up()" during the cpu unplug sequence. We
should fix the fundamental scheduler routines to simply make it impossible
for that to ever balance something back to a CPU that is going down.

And we shouldn't _care_ about what crazy things the cpusets code does.

See?

THAT is the reason for my patch. I think the cpusets callbacks are totally
insane, but I don't care. What I care about is that the scheduler got
confused just because those insane callbacks happened to make timing be
just subtle enough that (and I quote):

"try_to_wake_up() is called for one of these tasks from another CPU ->
the load-balancer (wake_idle()) picks up a "dead" CPU and places the
task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
-> oops."

IOW, we should never have had code that was that fragile in the first
place! It's totally INSANE to depend on complex and fragile code, when
we'd be much better off with simple code that always says: "I will not
migrate a task to a CPU that is going down".

Depending on complex (and conditional) scheduler domains data structures
is a *bug*. It's fragile, and it's a horrible design mistake.

Linus

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/