Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups
From: Xi Wang
Date: Thu May 08 2025 - 18:40:01 EST
On Thu, May 8, 2025 at 12:35 PM Waiman Long <llong@xxxxxxxxxx> wrote:
>
> On 5/8/25 1:51 PM, Xi Wang wrote:
> > I think our problem spaces are different. Perhaps your problems are closer to
> > hard real-time systems but our problems are about improving latency of existing
> > systems while maintaining efficiency (max supported cpu util).
> >
> > For hard real-time systems we sometimes throw cores at the problem and run no
> > more than one thread per cpu. But if we want efficiency we have to share cpus
> > with scheduling policies. Disconnecting the cpu scheduler with isolcpus results
> > in losing too much of the machine capacity. CPU scheduling is needed for both
> > kernel and userspace threads.
> >
> > For our use case we need to move kernel threads away from certain vcpu threads,
> > but other vcpu threads can share cpus with kernel threads. The ratio changes
> > from time to time. Permanently putting aside a few cpus results in a reduction
> > in machine capacity.
> >
> > The PF_NO_SETAFFINTIY case is already handled by the patch. These threads will
> > run in root cgroup with affinities just like before.
> >
> > The original justifications for the cpuset feature is here and the reasons are
> > still applicable:
> >
> > "The management of large computer systems, with many processors (CPUs), complex
> > memory cache hierarchies and multiple Memory Nodes having non-uniform access
> > times (NUMA) presents additional challenges for the efficient scheduling and
> > memory placement of processes."
> >
> > "But larger systems, which benefit more from careful processor and memory
> > placement to reduce memory access times and contention.."
> >
> > "These subsets, or “soft partitions” must be able to be dynamically adjusted, as
> > the job mix changes, without impacting other concurrently executing jobs."
> >
> > https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html
> >
> > -Xi
> >
> If you create a cpuset root partition, we are pushing some kthreads
> aways from CPUs dedicated to the newly created partition which has its
> own scheduling domain separate from the cgroup root. I do realize that
> the current way of excluding only per cpu kthreads isn't quite right. So
> I send out a new patch to extend to all the PF_NO_SETAFFINITY kthreads.
>
> So instead of putting kthreads into the dedicated cpuset, we still keep
> them in the root cgroup. Instead we can create a separate cpuset
> partition to run the workload without interference from the background
> kthreads. Will that functionality suit your current need?
>
> Cheers,
> Longman
>
It's likely a major improvement over a fixed partition but maybe still not fully
flexible. I am not familiar with cpuset partitions but I wonder if the following
case can be supported:
Starting from
16 cpus
Root has cpu 0-3, 8-15
Job A has cpu 4-7 exclusive
Kernel threads cannot run on cpu 4-8 which is good.
Now adding best effort Job B, which is under SCHED_IDLE and rarely enters kernel
mode. As we expect C can be easily preempted we allow it to share cpus with A
and kernel threads to maximize throughput. Is there a layout that supports the
requirements below?
Job C threads on cpu 0-15
Job A threads on cpu 4-7
No kernel threads on cpu 4-7
-Xi