Re: [PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems

From: Reinette Chatre
Date: Thu May 02 2024 - 13:01:17 EST


Hi Tony,

On 3/27/2024 1:03 PM, Tony Luck wrote:
> This series on top of v6.9-rc1 plus these two patches:
>
> Link: https://lore.kernel.org/all/20240308213846.77075-1-tony.luck@xxxxxxxxx/
>
> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features. Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
>
> Some of these CPU support an MSR that can partition the RMID counters in
> the same way. This allows monitoring features to be used. With the caveat
> that users must be aware that Linux may migrate tasks more frequently
> between SNC nodes than between "regular" NUMA nodes, so reading counters
> from all SNC nodes may be needed to get a complete picture of activity
> for tasks.
>
> Cache and memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
>
> This is a new approach triggered by the discussions that started with
> "How can users tell that SNC is enabled?" but then drifted into
> whether users of the legacy interface would really get what they
> expected when reading from monitor files in the mon_L3_* directories.
>
> During that discussion I'd mentioned providing monitor values for both
> the L3 level, and also for each SNC node. That would provide full ABI
> compatibility while also giving the finer grained reporting from each
> SNC node.
>
> Implementation sets up a new rdt_resource to track monitor resources
> with domains for each SNC node. This resource is only used when SNC
> mode is detected.
>
> On SNC systems there is a parent-child relationship between the
> old L3 resource and the new SUBL3 resource. Reading from legacy
> files like mon_data/mon_L3_00/llc_occupancy reads and sums the RMID
> counters from all "child" domains in the SUBL3 resource. E.g. on
> an SNC3 system:
>
> $ grep . mon_L3_01/llc_occupancy mon_L3_01/*/llc_occupancy
> mon_L3_01/llc_occupancy:413097984
> mon_L3_01/mon_SUBL3_03/llc_occupancy:141484032
> mon_L3_01/mon_SUBL3_04/llc_occupancy:135659520
> mon_L3_01/mon_SUBL3_05/llc_occupancy:135954432
>
> So the L3 occupancy shows the total L3 occupancy which is
> the sum of the cache occupancy on each of the SNC nodes
> that share that L3 cache instance.
>

This is back to the "duplicate a resource" solution that does the
unnecessary duplication of an entire resource and splits data structure
use between two resources for each resource to manage one portion.
My objections to this foundation on which this solution is
built remains.

This solution continues to build on this foundation in a way that
I find inappropriate. This solution wedges a new architectural concept
("parent" and "child" resources) into resctrl but does not integrate it.
All resources continue to be managed the same (not hierarchical) while
this artificial relationship is only selectively used when convenient
for SNC.

This solution also brings back a concept that was removed and yet
claims that it is "new" (bab6ee736873 ("x86/resctrl: Merge mon_capable
and mon_enabled"). Having gone through effort to clean this up deserves
a thorough justification why it is necessary to bring it back. I do
not see a justification though since it is used as a workaround
for resctrl to be able to deal with the unnecessary duplication of the
monitoring resource. Same thing for the rdt_l3_mon_resource global
pointer introduced here that is another workaround to deal with the
unnecessary duplication. These "tricks" to get this solution to work
makes this all very difficult to understand and maintain.

Using a "parent" resource to decide if one domain is a "child" of
another domain looks arbitrary to me. Just because the CPU mask of one
domain of one resource is a subset of the CPU mask of another domain of
another resource does not make the domain a child domain of the other.

It looks to me as though the "info" directory will now display two
monitoring resources to user space? I find that to be confusing and
the data exposed in that way is not clear to me ... for example,
both will now allow to set the global max_threshold_occupancy? How
should the number RMIDs be interpreted?

As I understand this rewrite aims to support the new file
hierarchy and to be able to have higher level files contain the sum of
the files below. I do not think a rewrite is necessary to
support this. I do continue to believe that the previous SNC enabling
is appropriate and as I understand support for this new display to user
space can be added to it. For example, could this not be accomplished
by adding one new member to struct rdt_resource that specifies the scope at
which to display the monitor data to user space? mkdir_mondata_subdir() can
use this field to decide how to create the files ... if the "display"
scope is the same as the monitoring scope then files are created as
it is now, if the "display" scope is different (required to be a superset)
then the files are created within a directory that matches the "display"
scope. I do not see why it would be required to have to work from a
stored CPU mask to determine relationships. I believe the id of the
"display" scope to create a sub-directory in can be determined dynamically
at this time. Similarly, the metadata of the files are used to indicate
when a sum is required.

Reinette