Re: [PATCH V2] arch_topology: Clear LLC sibling when cacheinfo teardown

From: Sudeep Holla
Date: Thu Mar 16 2023 - 10:51:25 EST


On Thu, Mar 16, 2023 at 10:30:52AM +0000, Song Shuai wrote:
> Sudeep Holla <sudeep.holla@xxxxxxx> 于2023年3月16日周四 09:29写道:
> >
> > On Tue, Mar 14, 2023 at 03:53:45PM +0800, Song Shuai wrote:
> > > The teardown of CPUHP_AP_BASE_CACHEINFO_ONLINE now only invokes
> > > free_cache_attributes() to clear share_cpu_map of cacheinfo list.
> > > At the same time, clearing cpu_topology[].llc_sibling is
> > > called quite late at the teardown code in hotplug STARTING section.
> > >
> > > To avoid the incorrect LLC sibling masks generated, move its clearing
> > > right after free_cache_attributes().
> > >
> >
> > Technically in terms of flow/timing this is correct. However I would like
> > to know if you are seeing any issues without this change ?
> >
> > Technically, if a cpu is hotplugged out, the cacheinfo is reset first
> > and then the topology. Until the cpu is removes, the LLC info in the
> > topology is still valid. Also I am not sure if anything gets scheduled
> > and this LLC info is utilised once the teardown of CPUHP_AP_BASE_CACHEINFO_ONLINE
> > has started.
>
> There is no visible issue in the entire offline process(eg: echo 0 > online).
>
> However, when I hotplugged out the CPU into the state before CACHEINFO_ONLINE on
> my kernel with the CONFIG_CPU_HOTPLUG_STATE_CONTROL configured,
> the share_cpu_map had been updated but llc_sibling had not, which
> would result in a trivial issue:
>
> At the end of stepped hotplugging out, the cpuset_hotplug_work would
> be flushed and then sched domain would be rebuilt
> where the **cpu_coregroup_mask** in sched_domain_topology got
> incorrect llc_sibling, but the result of rebuilding was correct due
> to the protection of cpu_active_mask.
>

Wait, I would like to disagree there. While I agree there is inconsistency
between cacheinfo cpu_shared_map and the llc_sibling in the tear down path,
it is still correct and terming it as "incorrect" llc_sibling is wrong.
The cpu is not yet completely offline yet and hence the llc_sibling
represents all the cpus it shares LLC. When the cpu is offlined, the
cpu_topology is anyway removed. So I don't see it as an issue at all.
If you follow __cpu_disable()->remove_cpu_topology(), it gets updated there.
If the sched_domain_topology is not rebuilt after that, then we may have
other issues. What am I missing ?

I am not bothered by cacheinfo cpu_shared_map and cpu_topology llc_sibling
mismatch for short window during the teardown as technically until the cpu
is torndown, it is sharing llc with llc_sibling and it is definitely not
wrong to have it in there.

> The stepped hotplugging may not be used in the production environment,
> but the issue existed.

What issue ? If it is just inconsistency, then I am fine to ignore. That
is just artificial and momentary and it impacts nothing.

> Even in the entire offline process, it's possible that a future user
> gets wrong the llc_sibling when accessing it concurrently or right
> after the teardown of CACHEINFO_ONLINE.

As I said, even if someone access it, it is not wrong information.

--
Regards,
Sudeep