[RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention

From: Kuba Piecuch
Date: Fri Aug 08 2025 - 16:03:40 EST

Next message: Kuba Piecuch: "[RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task()"
Previous message: Viacheslav Dubeyko: "Re: [PATCH] hfsplus: return EIO when type of hidden directory mismatch in hfsplus_fill_super()"
Next in thread: Kuba Piecuch: "[RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Problem statement
=================

Calls to sched_yield() can touch data shared with other threads.
Because of this, userspace threads could generate high levels of contention
by calling sched_yield() in a tight loop from multiple cores.

For example, if cputimer is enabled for a process (e.g. through
setitimer(ITIMER_PROF, ...)), all threads of that process
will do an atomic add on the per-process field
p->signal->cputimer->cputime_atomic.sum_exec_runtime inside
account_group_exec_runtime(), which is called inside update_curr().

Currently, calling sched_yield() will always call update_curr() at least
once in schedule(), and potentially one more time in yield_task_fair().
Thus, userspace threads can generate quite a lot of contention for the
cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of
a process call sched_yield() in a tight loop.

At Google, we suspect that this contention led to a full machine lockup in
at least one instance, with ~50% of CPU cycles spent in the atomic add
inside account_group_exec_runtime() according to
`perf record -a -e cycles`.

Proposed solution
=================

To alleviate the contention, this patchset introduces the ability to limit
how frequently a thread is allowed to yield. It adds a new sched debugfs
knob called yield_interval_ns. A thread is allowed to yield at most once
every yield_interval_ns nanoseconds. Subsequent calls to sched_yield()
within the interval simply return without calling schedule().

The default value of the knob is 0, meaning the throttling feature is
disabled by default.

Performance
===========

To test the impact on performance and contention, we used a benchmark
consisting of a process with a profiling timer enabled and N threads
sequentially assigned to logical cores, with 2 threads per core. Each
thread calls sched_yield() in a tight loop. We measured the total number
of unthrottled sched_yield() calls made by all threads within a fixed time.
In addition, we recorded the benchmark runs with
`perf record -a -g -e cycles`. We used the perf data to determine the
percentage of CPU time spent in the problematic atomic add instruction and
used that as a measure of contention.
We ran the benchmark on an Intel Emerald Rapids CPU with 60 physical cores.

With throttling disabled, there was no measurable performance impact to
sched_yield().
Setting the interval to 1ns, which enables the throttling code but doesn't
actually throttle any calls to sched_yield(), results in a 1-3% penalty
for sched_yield() with low thread counts, but disappears quickly as the
thread count gets higher and contention becomes more of a factor.

With throttling disabled, CPU time spent in the atomic add instruction
for N=80 threads is roughly 80%.
By setting yield_interval_ns to 10000, the percentage decreases to 1-2%,
but the total number of unthrottled sched_yield() calls decreases by ~60%.

Alternatives considered
=======================

An alternative we considered was to make the cputime accounting more
scalable by accumulating a thread's cputime locally in task_struct and
flushing it to the process-wide cputime when it reaches some threshold
value or when the thread is taken off the CPU. However, we determined that
the implementation is too intrusive compared to the benefit it provided.
It also wouldn't address other potential points of contention on the
sched_yield() path.

Kuba Piecuch (3):
sched: add bool return value to sched_class::yield_task()
sched/fair: don't schedule() in yield if nr_running == 1
sched/fair: add debugfs knob for yield throttling

include/linux/sched.h | 2 ++
kernel/sched/core.c | 1 +
kernel/sched/deadline.c | 4 +++-
kernel/sched/debug.c | 2 ++
kernel/sched/ext.c | 4 +++-
kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++--
kernel/sched/rt.c | 3 ++-
kernel/sched/sched.h | 4 +++-
kernel/sched/stop_task.c | 2 +-
kernel/sched/syscalls.c | 9 ++++++++-
10 files changed, 58 insertions(+), 8 deletions(-)

--
2.51.0.rc0.155.g4a0f42376b-goog

Next message: Kuba Piecuch: "[RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task()"
Previous message: Viacheslav Dubeyko: "Re: [PATCH] hfsplus: return EIO when type of hidden directory mismatch in hfsplus_fill_super()"
Next in thread: Kuba Piecuch: "[RFC PATCH 1/3] sched: add bool return value to sched_class::yield_task()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]