Re: IPC drop down on AMD epyc 7702P

From: Libo Chen
Date: Wed Apr 30 2025 - 22:47:13 EST


Hi Prateek,

On 4/30/25 04:29, K Prateek Nayak wrote:
> Hello Libo,
>
> On 4/30/2025 4:11 PM, Libo Chen wrote:
>>
>>
>> On 4/30/25 02:13, K Prateek Nayak wrote:
>>> (+ more scheduler folks)
>>>
>>> tl;dr
>>>
>>> JB has a workload that hates aggressive migration on the 2nd Generation
>>> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
>>> C2C latency.
>>>
>>> Based on JB's observation so far, reverting commit 16b0a7a1a0af
>>> ("sched/fair: Ensure tasks spreading in LLC during LB") and commit
>>> c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
>>> condition") helps the workload. Both those commits allow aggressive
>>> migrations for work conservation except it also increased cache
>>> misses which slows the workload quite a bit.
>>>
>>> "relax_domain_level" helps but cannot be set at runtime and I couldn't
>>> think of any stable / debug interfaces that JB hasn't tried out
>>> already that can help this workload.
>>>
>>> There is a patch towards the end to set "relax_domain_level" at
>>> runtime but given cpusets got away with this when transitioning to
>>> cgroup-v2, I don't know what the sentiments are around its usage.
>>> Any input / feedback is greatly appreciated.
>>>
>>
>>
>> Hi Prateek,
>>
>> Oh no, not "relax_domain_level" again, this can lead to load imbalance
>> in variety of ways. We were so glad this one went away with cgroupv2,
>
> I agree it is not pretty. JB also tried strategic pinning and they
> did report that things are better overall but unfortunately, it is
> very hard to deploy across multiple architectures and would also
> require some redesign + testing from their application side.
>

I was more of stressing broadly how bad setting "relax_domain_level"
could go wrong if an user doesn't know this essentially disables newidle
balancing at higher levels, so the ability to balance loads across CCXes
or NUMA nodes will be a lot weaker. A subset of CCXes may consistently
get much more loads due to a whole bunch of reasons. Sometimes this is
hard to spot in testing, but does show up in real-world scenarios, esp.
when users have other weird hacks.

>> it tends to be abused by users as an "easy" fix for some urgent perf
>> issues instead of addressing their root causes.
>
> Was there ever a report of similar issue where migrations for right
> reasons has led to performance degradation as a result of platform
> architecture? I doubt there is a straightforward way to solve this
> using the current interfaces - at least I haven't found one yet.
>

It wasn't due to platform architecture for us but more of "exotic" NUMA
topology (like a cubic, a node is one hop away from 3 neighbors, two
hops away from other 4) in combination with certain userlevel settings
that cause more wakeups in a subset of domains. If relax_domain_level
is left untouched, then you get no load imbalance but perf is bad. But
once you set relax_domain_level to restrict newidle balancing to lower
domain levels, you actually see better performance numbers in testing
even though CPU loads are not well-balanced. Until one day, you find
out the imbalance is so bad that it slows down everything. Luckily it
wasn't too hard to fix from the application side.

I get it may not be easy to fix from their application side in this
case and but I still think this is too hackery, one may end up
regretting.

I certainly want to hear what others think about relax_domain_level!

> Perhaps cache-aware scheduling is the way forward to solve these
> set of issues as Peter highlighted.
>

Hope so! We will start test that series and provide feedback


Thanks,
Libo