Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

From: Mario Limonciello
Date: Thu May 02 2024 - 06:33:43 EST


On 4/25/2024 16:42, Thomas Gleixner wrote:
Lyude!

On Thu, Apr 25 2024 at 11:56, Lyude Paul wrote:
On Thu, 2024-04-25 at 04:11 +0200, Thomas Gleixner wrote:

Can you please boot a kernel with the commit in question reverted and
add 'possible_cpus=8' to the kernel command line?

In theory this should fail too.

Yep - tried booting a kernel with f0551af0213 reverted and
possible_cpus=8, it definitely looks like that crashes things as well
in the same way.

Good. That means it's a problem which existed before but went unnoticed.

Also - it scrolled off the screen before I had a chance to write it
down, but I'm -fairly- sure I saw some sort of complaint about "16 [or
some double digit number] processors exceeds max number of 8". Which
is quite interesting, as this is definitely just a quad core ryzen
processor with hyperthreading - so there should only be 8 threads.

Right, that's what we saw with the debug patch. The ACPI/MADT table
is clearly bonkers. The effect of it is that it pretends that the system
has 16 possible CPUs:

[ 0.089381] CPU topo: Allowing 8 present CPUs plus 8 hotplug CPUs

Which in turn changes the sizing of the per CPU data and affects some
other details which depend on the number of possible CPUs.

At least this aspect of this I suspect is caused by commit fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c.

If you try reverting that I expect the "hotplug CPUs" disappear.


But that should not matter at all because the system scaling should be
sufficient with 8 CPUs, but it does not for some completely non-obvious
reasons.

Can you please try to increase possible_cpus=N on the command line one
by one and check when it actually starts to "work" again.

One other thing to try is to boot with 'possible_cpus=8' and
'intremap=off' and see whether that makes a difference.

I really have no idea where to look and not having the early boot
messages in case of the fail is not helpful as I can't add meaningful
debug to it.

I just checked: the motherboard has a serial port, so it would be
extremly helpful to hook up a serial cable to this thing and enable
serial console on the kernel command line. That way we might eventually
see information which is emitted before it fails to validate the timer
interrupt.

Thanks,

tglx