Re: [PATCH V2] arch_topology: Clear LLC sibling when cacheinfo teardown

From: Song Shuai
Date: Mon Mar 20 2023 - 02:20:52 EST


Sudeep Holla <sudeep.holla@xxxxxxx> 于2023年3月16日周四 14:51写道:
>
> On Thu, Mar 16, 2023 at 10:30:52AM +0000, Song Shuai wrote:
> > Sudeep Holla <sudeep.holla@xxxxxxx> 于2023年3月16日周四 09:29写道:
> > >
> > > On Tue, Mar 14, 2023 at 03:53:45PM +0800, Song Shuai wrote:
> > > > The teardown of CPUHP_AP_BASE_CACHEINFO_ONLINE now only invokes
> > > > free_cache_attributes() to clear share_cpu_map of cacheinfo list.
> > > > At the same time, clearing cpu_topology[].llc_sibling is
> > > > called quite late at the teardown code in hotplug STARTING section.
> > > >
> > > > To avoid the incorrect LLC sibling masks generated, move its clearing
> > > > right after free_cache_attributes().
> > > >
> > >
> > > Technically in terms of flow/timing this is correct. However I would like
> > > to know if you are seeing any issues without this change ?
> > >
> > > Technically, if a cpu is hotplugged out, the cacheinfo is reset first
> > > and then the topology. Until the cpu is removes, the LLC info in the
> > > topology is still valid. Also I am not sure if anything gets scheduled
> > > and this LLC info is utilised once the teardown of CPUHP_AP_BASE_CACHEINFO_ONLINE
> > > has started.
> >
> > There is no visible issue in the entire offline process(eg: echo 0 > online).
> >
> > However, when I hotplugged out the CPU into the state before CACHEINFO_ONLINE on
> > my kernel with the CONFIG_CPU_HOTPLUG_STATE_CONTROL configured,
> > the share_cpu_map had been updated but llc_sibling had not, which
> > would result in a trivial issue:
> >
> > At the end of stepped hotplugging out, the cpuset_hotplug_work would
> > be flushed and then sched domain would be rebuilt
> > where the **cpu_coregroup_mask** in sched_domain_topology got
> > incorrect llc_sibling, but the result of rebuilding was correct due
> > to the protection of cpu_active_mask.
> >
>
> Wait, I would like to disagree there. While I agree there is inconsistency
> between cacheinfo cpu_shared_map and the llc_sibling in the tear down path,
> it is still correct and terming it as "incorrect" llc_sibling is wrong.
> The cpu is not yet completely offline yet and hence the llc_sibling
> represents all the cpus it shares LLC. When the cpu is offlined, the
> cpu_topology is anyway removed. So I don't see it as an issue at all.
> If you follow __cpu_disable()->remove_cpu_topology(), it gets updated there.
> If the sched_domain_topology is not rebuilt after that, then we may have
> other issues. What am I missing ?
>
> I am not bothered by cacheinfo cpu_shared_map and cpu_topology llc_sibling
> mismatch for short window during the teardown as technically until the cpu
> is torndown, it is sharing llc with llc_sibling and it is definitely not
> wrong to have it in there.
>
> > The stepped hotplugging may not be used in the production environment,
> > but the issue existed.
>
> What issue ? If it is just inconsistency, then I am fine to ignore. That
> is just artificial and momentary and it impacts nothing.

My original point is to clear the llc_sibling right after clearing of
share_cpu_map
like what you did in 3fcbf1c77d08.

And the ~~issue~~ I described above was found when I manually tested
the 'base/cacheinfo:online' hpstate,
which can be triggered by the following commands:

```
hpid=$(sed -n '/cacheinfo/s/:.*//p' /sys/devices/system/cpu/hotplug/states)
echo $((hpid-1)) > /sys/devices/system/cpu/cpu2/hotplug/target

```

Anyway, the short inconsistency window you explained seems acceptable to me.

>
> > Even in the entire offline process, it's possible that a future user
> > gets wrong the llc_sibling when accessing it concurrently or right
> > after the teardown of CACHEINFO_ONLINE.
>
> As I said, even if someone access it, it is not wrong information.
>
> --
> Regards,
> Sudeep



--
Thanks,
Song