Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention

From: Chen, Yu C
Date: Mon Jul 21 2025 - 07:24:25 EST


On 7/7/2025 10:35 AM, Pan Deng wrote:
When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on bitmap of `cpupri_vec->cpumask`.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals:
cpumask (bitmap) cache line of `cpupri_vec->mask`:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K

This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
mitigate false sharing.

As a result:
- FPS improves by ~3.8%
- Kernel cycles% drops from ~20% to ~18.7%
- Cache line contention is mitigated, perf-c2c shows cycles per load
drops from ~2.2K-8.7K to ~0.5K-2.2K


This brings noticeable improvement for RT workload, and it would
be even more convincing if we can have try on normal task workload,
at least not bring regression(schbench/hackbenc, etc).

thanks,
Chenyu

Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.