Re: [3.14] core onlining/hotplug regression

From: Daniel J Blueman
Date: Fri Jul 25 2014 - 05:37:24 EST


On 07/25/2014 05:05 PM, Thomas Gleixner wrote:
On Fri, 25 Jul 2014, Daniel J Blueman wrote:
On a larger x86 system with 1728 cores, 3.15(.6) asserts on
smpboot_thread_fn's td->cpu != smp_processor_id() consistently after ~1500
cores are online.

Reverting the only directly related changes I could find [1,2] doesn't help.
Debugging indicates there is a race where the created thread is quickly
migrated to core 0 when this occurs, since smp_processor_id returns 0 in these
cases. Thomas introduced a thread parked state to fix related issues a year
back. Linux 3.14(.13) boots just nice.

Weird. Commits [1,2] are definitely not the culprits.

Full boot output is at:
https://resources.numascale.com/linux-315-thread-mig.txt

Not really helpful, as we don't see what causes it. We just see the
wreckage.

Any theories so far? I'll start bisecting when I have full access to the
system again in a week and I'll do some more debugging with intermittent
access before then.

One thing you could try is enabling tracing.

"ftrace=function ftrace_dump_on_oops"

It'll take a looooong time to spill out the traces, but that should
give us the root cause precisely.

Good trick. I'll get this early next week and we'll see what's up.

Thanks,
Daniel
--
Daniel J Blueman
Principal Software Engineer, Numascale
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/