Re: [PATCH v3] eventpoll: Fix priority inversion problem

From: K Prateek Nayak
Date: Mon Jun 30 2025 - 11:16:25 EST


Hello Nam,

On 5/27/2025 2:38 PM, Nam Cao wrote:
The ready event list of an epoll object is protected by read-write
semaphore:

- The consumer (waiter) acquires the write lock and takes items.
- the producer (waker) takes the read lock and adds items.

The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.

Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].

Furthermore, this could also cause stall problem, as described in [2].

To fix this problem, make the event list half-lockless:

- The consumer acquires a mutex (ep->mtx) and takes items.
- The producer locklessly adds items to the list.

Performance is not the main goal of this patch, but as the producer now can
add items without waiting for consumer to release the lock, performance
improvement is observed using the stress test from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c. This is
the same test that justified using read-write semaphore in the past.

Testing using 12 x86_64 CPUs:

Before After Diff
threads events/ms events/ms
8 6932 19753 +185%
16 7820 27923 +257%
32 7648 35164 +360%
64 9677 37780 +290%
128 11166 38174 +242%

Testing using 1 riscv64 CPU (averaged over 10 runs, as the numbers are
noisy):

Before After Diff
threads events/ms events/ms
1 73 129 +77%
2 151 216 +43%
4 216 364 +69%
8 234 382 +63%
16 251 392 +56%


I gave this patch a spin on top of tip:sched/core (PREEMPT_RT) with
Jan's reproducer from
https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@xxxxxxxxxxx/.

On tip:sched/core, I see a hang few seconds into the run and rcu-stall
a minute after when I pin the epoll-stall and epoll-stall-writer on the
same CPU as the Bandwidth timer on a 2vCPU VM. (I'm using a printk to
log the CPU where the timer was started in pinned mode)

With this series, I haven't seen any stalls yet over multiple short
runs (~10min) and even a longer run (~3Hrs).

Feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>

Reported-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider <vschneid@xxxxxxxxxx>
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@xxxxxxxxxxxxxxxxxxx/ [2]
Signed-off-by: Nam Cao <namcao@xxxxxxxxxxxxx>
---
v3:
- get rid of the "link_used" and "ready" flags. They are hard to
understand and unnecessary
- get rid of the obsolete lockdep_assert_irqs_enabled()
- Add lockdep_assert_held(&ep->mtx)
- rewrite some comments
v2:
- rename link_locked -> link_used
- replace xchg() with smp_store_release() when applicable
- make sure llist_node is in clean state when not on a list
- remove now-unused list_add_tail_lockless()

--
Thanks and Regards,
Prateek