Re: sysfs topology for arm64 cluster_id

From: Don Dutile
Date: Wed Jan 14 2015 - 11:07:43 EST


On 01/13/2015 07:47 PM, Jon Masters wrote:
Hi Folks,

TLDR: I would like to consider the value of adding something like
"cluster_siblings" or similar in sysfs to describe ARM topology.

A quick question on intended data representation in /sysfs topology
before I ask the team on this end to go down the (wrong?) path. On ARM
systems today, we have a hierarchical CPU topology:

Socket ---- Coherent Interonnect ---- Socket
| |
Cluster0 ... ClusterN Cluster0 ... ClusterN
| | | |
Core0...CoreN Core0...CoreN Core0...CoreN Core0...CoreN
| | | | | | | |
T0..TN T0..Tn T0..TN T0..TN T0..TN T0..TN T0..TN T0..TN

Where we might (or might not) have threads in individual cores (a la SMT
- it's allowed in the architecture at any rate) and we group cores
together into units of clusters usually 2-4 cores in size (though this
varies between implementations, some of which have different but similar
concepts, such as AppliedMicro Potenza PMDs CPU complexes of dual
cores). There are multiple clusters per "socket", and there might be an
arbitrary number of sockets. We'll start to enable NUMA soon.

The existing ARM architectural code understands expressing topology in
terms of the above, but it doesn't quite map these concepts directly in
sysfs (does not expose cluster_ids as an example). Currently, a cpu-map
in DeviceTree can expose hierarchies (included nested clusters) and this
is parsed at boot time to populate scheduler information, as well as the
topology files in sysfs (if that is provided - none of the reference
devicetrees upstream do this today, but some exist). But the cluster
information itself isn't quite exposed (whereas other whacky
architectural concepts such as s390 books are exposed already today).

Anyway. We have a small problem with tools such as those in util-linux
(lscpu) getting confused as a result of translating x86-isms to ARM. For
example, the lscpu utility calculates the number of sockets using the
following computation:

nsockets = desc->ncpus / nthreads / ncores

(number of sockets = total number of online processing elements /
threads within a single core / cores within a single socket)

If you're not careful, you can end up with something like:

# lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 4

Basically, in the top-most diagram, lscpu (& hwloc) are equating Cluster<N>
as socket<N>. I'm curious what the sysfs numa info will be interpreted
as when/if that is turned on for arm64.

Now we can argue that the system in question needs an updated cpu-map
(it'll actually be something ACPI but I'm keeping this discussion to DT
to avoid that piece further in discussion, and you can assume I'm
booting any test boxes in further work on this using DeviceTree prior to
switching the result over to ACPI) but either way, util-linux is
thinking in an x86-centric sense of what these files mean. And I think
the existing topology/cpu-map stuff in arm64 is doing the same.

The above values are extracted from the MPIDR:Affx fields and is currently
independent of DT & ACPI.
The Aff1 field is the 'cluster-id' and is being used to associated cpu's (via cpu masks)
to siblings. lscpu & hwloc associate cpu-nums & siblings to sockets via the above
calculation, which doesn't quite show how siblings enter the equation
ncores = CPU_COUNT_S(setsize, core_siblings) / nthreads;

Note: in the arm(32) tree, what was 'socket-id' is 'cluster-id' in arm64;
I believe this 'mapping' (backporting/association) is one root problem
in the arch/arm64/kernel/topology.c code.

Now, a simple, yet requiring lots of fun, cross-architecture testing, would
be to change lscpu to use the sysfs physical_package_id to get Socket correct. Yet,
that won't fix the above 'Core(s) per socket' because that's being created
via the sibling masks, which are generated from the cluster-id.
This change would require arm(64) to implement DT & ACPI methods to
extract pcpu's to sockets (missing at the moment).

And modifying the cluster-id and/or the siblings masks creates non-topology
(non-lscpu, non-hwloc) issues like breaking gic init code paths which use
the cluster-id information as well. ... some 'empirical data' to note
if anyone thinks it's just a topology-presentation issue.

Is it not a good idea to expose the cluster details directly in sysfs
and have these utilities understand the possible extra level in the
calculation? Or do we want to just fudge the numbers (as seems to be the
case in some systems I am seeing) to make the x86 model add up?

Short-term, I'm trying to develop a reasonable 'fudge' for lscpu & hwloc,
that doesn't impact the (proper) operation of the gic code.
I haven't dug deep enough yet, but this also requires a check on how
the scheduler associates cpu-cache-sibling associativity when selecting
optimal cpu to schedule threads on.

Let me know the preferred course...

Jon.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/