RE: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention

From: Deng, Pan
Date: Tue Jul 22 2025 - 10:52:17 EST



> -----Original Message-----
> From: Chen, Yu C <yu.c.chen@xxxxxxxxx>
> Sent: Monday, July 21, 2025 7:24 PM
> To: Deng, Pan <pan.deng@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; Li, Tianyou <tianyou.li@xxxxxxxxx>;
> tim.c.chen@xxxxxxxxxxxxxxx; peterz@xxxxxxxxxxxxx; mingo@xxxxxxxxxx
> Subject: Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
> node to reduce contention
>
> On 7/7/2025 10:35 AM, Pan Deng wrote:
> > When running a multi-instance FFmpeg workload on HCC system,
> > significant contention is observed on bitmap of `cpupri_vec->cpumask`.
> >
> > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical
> > cores
> > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > with FIFO scheduling. FPS is used as score.
> >
> > perf c2c tool reveals:
> > cpumask (bitmap) cache line of `cpupri_vec->mask`:
> > - bits are loaded during cpupri_find
> > - bits are stored during cpupri_set
> > - cycles per load: ~2.2K to 8.7K
> >
> > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> > mitigate false sharing.
> >
> > As a result:
> > - FPS improves by ~3.8%
> > - Kernel cycles% drops from ~20% to ~18.7%
> > - Cache line contention is mitigated, perf-c2c shows cycles per load
> > drops from ~2.2K-8.7K to ~0.5K-2.2K
> >
>
> This brings noticeable improvement for RT workload, and it would be even
> more convincing if we can have try on normal task workload, at least not bring
> regression(schbench/hackbenc, etc).
>

Thanks Yu, hackbench and schbench data will be provided later.


> thanks,
> Chenyu
>
> > Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.
> >
>