Le Thu, May 08, 2025 at 03:24:13PM -0400, Waiman Long a écrit :AFAIU, the kthread_bind_mask() or the kthread_bin_cpu() functions will set PF_NO_SETAFFINITY.
Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask()Acked-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
on top_cpuset") enabled us to pull CPUs dedicated to child partitions
from tasks in top_cpuset by ignoring per cpu kthreads. However, there
can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY
flag set to indicate that we shouldn't mess with their CPU affinity.
For other kthreads, their affinity will be changed to skip CPUs dedicated
to child partitions whether it is an isolating or a scheduling one.
As all the per cpu kthreads have PF_NO_SETAFFINITY set, the
PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads.
Fix this issue by dropping the kthread_is_per_cpu() check and checking
the PF_NO_SETAFFINITY flag instead.
Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset")
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/cgroup/cpuset.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d0143b3dce47..967603300ee3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1130,9 +1130,11 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
if (top_cs) {
/*
- * Percpu kthreads in top_cpuset are ignored
+ * PF_NO_SETAFFINITY tasks are ignored.
+ * All per cpu kthreads should have PF_NO_SETAFFINITY
+ * flag set, see kthread_set_per_cpu().
*/
- if (kthread_is_per_cpu(task))
+ if (task->flags & PF_NO_SETAFFINITY)
continue;
cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus);
But this makes me realize I overlooked that when I introduced the unbound kthreads
centralized affinity.
cpuset_update_tasks_cpumask() seem to blindly affine to subpartitions_cpus
while unbound kthreads might have their preferences (per-nodes or random cpumasks).
So I need to make that pass through kthread API.
Most users that want isolated CPUs will set both isolcpus and nohz_full to the same set of CPUs. I do see that RH OpenShift can set nohz_full for a collection of CPUs that may be dynamically isolated later on via cpuset partition.
It seems that subpartition_cpus doesn't contain nohz_full= CPUs.
But it excludes isolcpus=. And it's usually sane to assume that
nohz_full= CPUs are isolated.
I think I can just rename update_unbound_workqueue_cpumask()
to update_unbound_kthreads_cpumask() and then handle unbound
kthreads from there along with workqueues. And then completely
ignore kthreads from cpuset_update_tasks_cpumask().
Let me think about it (but feel free to apply the current patch meanwhile).
Thanks.