Re: [PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path
From: Sudeep Holla
Date: Tue Jul 19 2022 - 06:29:31 EST
On Mon, Jul 18, 2022 at 05:57:33PM +0000, Conor.Dooley@xxxxxxxxxxxxx wrote:
> On 18/07/2022 18:41, Guenter Roeck wrote:
> > On Wed, Jul 13, 2022 at 02:33:44PM +0100, Sudeep Holla wrote:
> >> init_cpu_topology() is called only once at the boot and all the cache
> >> attributes are detected early for all the possible CPUs. However when
> >> the CPUs are hotplugged out, the cacheinfo gets removed. While the
> >> attributes are added back when the CPUs are hotplugged back in as part
> >> of CPU hotplug state machine, it ends up called quite late after the
> >> update_siblings_masks() are called in the secondary_start_kernel()
> >> resulting in wrong llc_sibling_masks.
> >>
> >> Move the call to detect_cache_attributes() inside update_siblings_masks()
> >> to ensure the cacheinfo is updated before the LLC sibling masks are
> >> updated. This will fix the incorrect LLC sibling masks generated when
> >> the CPUs are hotplugged out and hotplugged back in again.
> >>
> >> Reported-by: Ionela Voinescu <ionela.voinescu@xxxxxxx>
> >> Signed-off-by: Sudeep Holla <sudeep.holla@xxxxxxx>
> >> ---
> >> drivers/base/arch_topology.c | 16 ++++++----------
> >> 1 file changed, 6 insertions(+), 10 deletions(-)
> >>
> >> Hi Conor,
> >>
> >> Ionela reported an issue with the CPU hotplug and as a fix I need to
> >> move the call to detect_cache_attributes() which I had thought to keep
> >> it there from first but for no reason had moved it to init_cpu_topology().
> >>
> >> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> >> cpu in the secondary CPUs init path while init_cpu_topology executed
> >> detect_cache_attributes() for all possible CPUs much earlier. I think
> >> this might help as the percpu memory might be initialised in this case.
> >>
> >> Anyways give this a try, also test the CPU hotplug and check if nothing
> >> is broken on RISC-V. We noticed this bug only on one platform while
> >>
> >
> > arm64, with next-20220718:
> >
> > ...
> > [ 0.823405] Detected PIPT I-cache on CPU1
> > [ 0.824456] BUG: sleeping function called from invalid context at kernel/locking/semaphore.c:164
> > [ 0.824550] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1
> > [ 0.824600] preempt_count: 1, expected: 0
> > [ 0.824633] RCU nest depth: 0, expected: 0
> > [ 0.824899] no locks held by swapper/1/0.
> > [ 0.825035] irq event stamp: 0
> > [ 0.825072] hardirqs last enabled at (0): [<0000000000000000>] 0x0
> > [ 0.826017] hardirqs last disabled at (0): [<ffff800008158870>] copy_process+0x5e0/0x18e4
> > [ 0.826123] softirqs last enabled at (0): [<ffff800008158870>] copy_process+0x5e0/0x18e4
> > [ 0.826191] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > [ 0.826764] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.19.0-rc7-next-20220718 #1
> > [ 0.827397] Call trace:
> > [ 0.827456] dump_backtrace.part.0+0xd4/0xe0
> > [ 0.827574] show_stack+0x18/0x50
> > [ 0.827625] dump_stack_lvl+0x9c/0xd8
> > [ 0.827678] dump_stack+0x18/0x34
> > [ 0.827722] __might_resched+0x178/0x220
> > [ 0.827778] __might_sleep+0x48/0x80
> > [ 0.827833] down_timeout+0x2c/0xa0
> > [ 0.827896] acpi_os_wait_semaphore+0x68/0x9c
> > [ 0.827952] acpi_ut_acquire_mutex+0x4c/0xb8
> > [ 0.828008] acpi_get_table+0x38/0xbc
> > [ 0.828059] acpi_find_last_cache_level+0x44/0x130
> > [ 0.828112] init_cache_level+0xb8/0xcc
> > [ 0.828165] detect_cache_attributes+0x240/0x580
> > [ 0.828217] update_siblings_masks+0x28/0x270
> > [ 0.828270] store_cpu_topology+0x64/0x74
> > [ 0.828326] secondary_start_kernel+0xd0/0x150
> > [ 0.828386] __secondary_switched+0xb0/0xb4
> >
> > I know the problem has already been reported, but I think the backtrace
> > above is slightly different.
>
Thanks for the report, I forgot to run with lockdep on ACPI system. This is
trickier. I will take a look at it.
> Aye, I got a different BT on RISC-V + DT - but that should be fixed in
> next-20220718. This is a different problem unfortunately.
Yes, ACPI is bit different flow.
--
Regards,
Sudeep