Re: [RFC/PATCH] sched: Support moving kthreads into cpuset cgroups

From: Waiman Long
Date: Thu May 08 2025 - 15:35:17 EST


On 5/8/25 1:51 PM, Xi Wang wrote:
I think our problem spaces are different. Perhaps your problems are closer to
hard real-time systems but our problems are about improving latency of existing
systems while maintaining efficiency (max supported cpu util).

For hard real-time systems we sometimes throw cores at the problem and run no
more than one thread per cpu. But if we want efficiency we have to share cpus
with scheduling policies. Disconnecting the cpu scheduler with isolcpus results
in losing too much of the machine capacity. CPU scheduling is needed for both
kernel and userspace threads.

For our use case we need to move kernel threads away from certain vcpu threads,
but other vcpu threads can share cpus with kernel threads. The ratio changes
from time to time. Permanently putting aside a few cpus results in a reduction
in machine capacity.

The PF_NO_SETAFFINTIY case is already handled by the patch. These threads will
run in root cgroup with affinities just like before.

The original justifications for the cpuset feature is here and the reasons are
still applicable:

"The management of large computer systems, with many processors (CPUs), complex
memory cache hierarchies and multiple Memory Nodes having non-uniform access
times (NUMA) presents additional challenges for the efficient scheduling and
memory placement of processes."

"But larger systems, which benefit more from careful processor and memory
placement to reduce memory access times and contention.."

"These subsets, or “soft partitions” must be able to be dynamically adjusted, as
the job mix changes, without impacting other concurrently executing jobs."

https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html

-Xi

If you create a cpuset root partition, we are pushing some kthreads aways from CPUs dedicated to the newly created partition which has its own scheduling domain separate from the cgroup root. I do realize that the current way of excluding only per cpu kthreads isn't quite right. So I send out a new patch to extend to all the PF_NO_SETAFFINITY kthreads.

So instead of putting kthreads into the dedicated cpuset, we still keep them in the root cgroup. Instead we can create a separate cpuset partition to run the workload without interference from the background kthreads. Will that functionality suit your current need?

Cheers,
Longman