Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

From: Srikar Dronamraju
Date: Wed Jul 29 2020 - 02:21:03 EST


* Valentin Schneider <valentin.schneider@xxxxxxx> [2020-07-28 16:03:11]:

Hi Valentin,

Thanks for looking into the patches.

> On 27/07/20 06:32, Srikar Dronamraju wrote:
> > Add percpu coregroup maps and masks to create coregroup domain.
> > If a coregroup doesn't exist, the coregroup domain will be degenerated
> > in favour of SMT/CACHE domain.
> >
>
> So there's at least one arm64 platform out there with the same "pairs of
> cores share L2" thing (Ampere eMAG), and that lives quite happily with the
> default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
> domain, and the whole system is covered by DIE.
>
> Now arguably it's not a perfect representation; DIE doesn't have
> SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
> will impact all callsites using cpus_share_cache(): in the eMAG case, only
> pairs of cores will be seen as sharing cache, even though *all* cores share
> the same L3.
>

Okay, Its good to know that we have a chip which is similar to P9 in
topology.

> I'm trying to paint a picture of what the P9 topology looks like (the one
> you showcase in your cover letter) to see if there are any similarities;
> from what I gather in [1], wikichips and your cover letter, with P9 you can
> have something like this in a single DIE (somewhat unsure about L3 setup;
> it looks to be distributed?)
>
> +---------------------------------------------------------------------+
> | L3 |
> +---------------+-+---------------+-+---------------+-+---------------+
> | L2 | | L2 | | L2 | | L2 |
> +------+-+------+ +------+-+------+ +------+-+------+ +------+-+------+
> | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 |
> +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
> |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
> +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
>
> Which would lead to (ignoring the whole SMT CPU numbering shenanigans)
>
> NUMA [ ...
> DIE [ ]
> MC [ ] [ ] [ ] [ ]
> BIGCORE [ ] [ ] [ ] [ ]
> SMT [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]
> 00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 <other node here>
>

What you have summed up is perfectly what a P9 topology looks like. I dont
think I could have explained it better than this.

> This however has MC == BIGCORE; what makes it you can have different spans
> for these two domains? If it's not too much to ask, I'd love to have a P9
> topology diagram.
>
> [1]: 20200722081822.GG9290@xxxxxxxxxxxxxxxxxx

At this time the current topology would be good enough i.e BIGCORE would
always be equal to a MC. However in future we could have chips that can have
lesser/larger number of CPUs in llc than in a BIGCORE or we could have
granular or split L3 caches within a DIE. In such a case BIGCORE != MC.

Also in the current P9 itself, two neighbouring core-pairs form a quad.
Cache latency within a quad is better than a latency to a distant core-pair.
Cache latency within a core pair is way better than latency within a quad.
So if we have only 4 threads running on a DIE all of them accessing the same
cache-lines, then we could probably benefit if all the tasks were to run
within the quad aka MC/Coregroup.

I have found some benchmarks which are latency sensitive to benefit by
having a grouping a quad level (using kernel hacks and not backed by
firmware changes). Gautham also found similar results in his experiments
but he only used binding within the stock kernel.

I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
domain need not be LLC domain for Power.

--
Thanks and Regards
Srikar Dronamraju