[PATCH v10 0/7] Preparatory changes for Proxy Execution v10

From: John Stultz
Date: Tue May 07 2024 - 00:55:10 EST


As mentioned a few times previously[1], after earlier
submissions of the Proxy Execution series didn’t get much in the
way of feedback, it was noted that the patch series was getting
a bit unwieldy to review. Qais suggested I break out just the
cleanups/preparatory components of the patch series and submit
them on their own in the hope we can start to merge the less
complex bits and discussion can focus on the more complicated
portions afterwards. This so far has not been very successful,
with the submission & RESEND of the v8 & v9 preparatory changes
not getting all that much in the way of review or feedback.

For v10 of this series, I’m again only submitting those early
cleanup/preparatory changes here. However, please let me know if
there is any way to make reviewing the series easier to move
this forward.

In the meantime, I’ve continued to put effort into the full
series, mostly focused on polishing the series for correctness.

Unfortunately one issue I found ended up taking awhile to
determine it was actually a problem in mainline (the RT_PUSH_IPI
feature broke the RT scheduling invariant - after disabling it
I don’t see problems with mainline or with proxy-exec). But going
through the analysis process was helpful, and I’ve made some
tweaks to Metin’s patch for trace events to make it easier to
follow along the proxy behavior using ftrace & perfetto. Doing
this also helped find a case where when we were proxy-migrating
current, we first schedule idle, but didn’t preserve the
needs_resched flag, needlessly delaying things.

If you are interested, the full v10 series, it can be found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v10-6.9-rc7
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v10-6.9-rc7


New in v10 (in the preparatory patches submitted here)
---------
* Switched preempt_enable to be lower close to the unlock as
suggested by Valentin

* Added additional preempt_disable coverage around the wake_q
calls as again noted by Valentin

* Handle null lock ptr in __mutex_owner, to simplify later code,
as suggested by Metin Kaya

* Changed do_push_task to move_queued_task_locked as suggested
by Valentin

* Use rq_selected in push_rt_task & get_push_task

* Added Reviewed by tags

New in v10 (in the rest of the series)
---------
* Tweak so that if find_proxy_task returns idle, we should
always preserve needs_resched

* Drop WARN_ON(task_is_blocked(p)) in ttwu current case

* Add more details to the traceevents (owner task for proxy
migrations, and prev, selected and next for task selection)
so its easier to understand the proxy behavior.

* Simplify logic to task_queued_on_rq suggested by Metin

* Rework from do_push_task usage to move_queued_task_locked

* Further Cleanups suggested by Metin


Performance:
---------
K Prateek Nayak provided some feedback on the full v8 series
here[2]. Given the potential extra overhead of doing rq
migrations/return migrations/etc for the proxy case, it’s not
completely surprising a few of K Prateek’s test cases saw ~3-5%
regressions, but I’m hoping to look into this soon to see if we
can reduce those further.


Issues still to address:
---------
* The chain migration functionality needs further iterations and
better validation to ensure it truly maintains the RT/DL load
balancing invariants.

* CFS load balancing. There was concern that blocked tasks may
carry forward load (PELT) to the lock owner's CPU, so the CPU
may look like it is overloaded. Needs investigation.

* The sleeping owner handling (where we deactivate waiting tasks
and enqueue them onto a list, then reactivate them when the
owner wakes up) doesn’t feel great. This is in part because
when we want to activate tasks, we’re already holding a
task.pi_lock and a rq_lock, just not the locks for the task
we’re activating, nor the rq we’re enqueuing it onto. So there
has to be a bit of lock juggling to drop and acquire the right
locks (in the right order). It feels like there’s got to be a
better way. Also needs some rework to get rid of the
recursion.


Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history:

First described in a paper[3] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Thanks so much!
-john

[1] https://lore.kernel.org/lkml/20240401234439.834544-1-jstultz@xxxxxxxxxx/
[2] https://lore.kernel.org/lkml/c26251d2-e1bf-e5c7-0636-12ad886e1ea8@xxxxxxx/
[3] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelaf@xxxxxxxxxx>
Cc: Qais Yousef <qyousef@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Zimuzo Ezeozue <zezeozue@xxxxxxxxxx>
Cc: Youssef Esmat <youssefesmat@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Will Deacon <will@xxxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
Cc: Metin Kaya <Metin.Kaya@xxxxxxx>
Cc: Xuewen Yan <xuewen.yan94@xxxxxxxxx>
Cc: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: kernel-team@xxxxxxxxxxx


Connor O'Brien (2):
sched: Add move_queued_task_locked helper
sched: Consolidate pick_*_task to task_is_pushable helper

John Stultz (1):
sched: Split out __schedule() deactivate task logic into a helper

Juri Lelli (2):
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()

Peter Zijlstra (2):
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Split scheduler and execution contexts

kernel/locking/mutex.c | 60 +++++++----------
kernel/locking/mutex.h | 27 ++++++++
kernel/locking/rtmutex.c | 30 ++++++---
kernel/locking/rwbase_rt.c | 8 ++-
kernel/locking/rwsem.c | 4 +-
kernel/locking/spinlock_rt.c | 3 +-
kernel/locking/ww_mutex.h | 49 ++++++++------
kernel/sched/core.c | 122 +++++++++++++++++++++--------------
kernel/sched/deadline.c | 53 ++++++---------
kernel/sched/fair.c | 18 +++---
kernel/sched/rt.c | 61 +++++++-----------
kernel/sched/sched.h | 48 +++++++++++++-
12 files changed, 282 insertions(+), 201 deletions(-)

--
2.45.0.rc1.225.g2a3ae87e7f-goog