Re: [PATCH] x86/tsc: Use topology_max_packages() to get package number

From: Feng Tang
Date: Mon Mar 18 2024 - 22:25:25 EST


On Mon, Mar 18, 2024 at 09:35:57AM -0700, Dave Hansen wrote:
> On 3/15/24 04:26, Feng Tang wrote:
> > Thomas' recent patchset of refactoring x86 topology code introduces
> > topology_max_package(), which works well in most of the above cases.
> > The only exceptions are 'nr_cpus=' and 'possible_cpus=' setup, which
> > sets up the 'nr_cpu_ids' and rejects the rest of the CPUs, and may
> > cause topology_max_package() less than the real package number, but
> > it's fine as it is rarely used debug option, and logical package
> > number really matters in this check. So use the more accurate
> > topology_max_package() to replace nr_online_nodes().
>
> This is a lot longer than what you have in this changelog, but I hope it
> makes sense. This is retreading some of what Thomas's set does, but I
> think it's worth re-explaining:
>
> The boot CPU iterates through all enumerated APIC ids and either accepts
> or rejects the APIC id. For accepted ids, it figures out which bits of
> the id map to the package number. It tracks which package numbers have
> been seen in a bitmap. topology_max_package() just returns the number
> of bits set in that bitmap.
>
> 'nr_cpus=' and 'possible_cpus=' can cause more APIC ids to be rejected
> and can artificially lower the number of bits in the package bitmap and
> thus topology_max_package().
>
> This means that, for example, a system with 8 physical packages might
> reject all the CPUs on 6 of those packages and be left with only 2
> packages and 2 bits set in the package bitmap. It needs the TSC
> watchdog, but would disable it anyway. This isn't ideal, but this only
> happens for debug-oriented options.
>
> This is fixable by tracking the package numbers for rejected CPUs. But
> it's not worth the trouble for debugging.

Will steal it for the v2 :). Many thanks!

>
> --
>
> The only thing that comes to mind is if we need something simple like:
>
> if (topo_info.nr_rejected_cpus)
> pr_info("TSC might be buggered\n");
>
> ... somewhere.

I think logically this should be together with the package checking
code in tsc.c, and we need to export the info from static data
'topo_info'. The other thought is adding in topology code:

--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -460,8 +460,11 @@ void __init topology_init_possible_cpus(void)
pr_info("Num. threads per package: %3u\n", __num_threads_per_package);

pr_info("Allowing %u present CPUs plus %u hotplug CPUs\n", assigned, disabled);
- if (topo_info.nr_rejected_cpus)
+ if (topo_info.nr_rejected_cpus) {
pr_info("Rejected CPUs %u\n", topo_info.nr_rejected_cpus);
+ if (__max_logical_packages <= 4)
+ pr_warning("TSC might be bugged due to these rejected CPUs\n");
+ }

Thanks,
Feng