Re: [PATCH] dl_server: Reset DL server params when rd changes
From: Waiman Long
Date: Thu Nov 07 2024 - 23:40:44 EST
On 11/6/24 1:05 PM, Waiman Long wrote:
On 11/6/24 11:08 AM, Juri Lelli wrote:
On 04/11/24 17:41, Joel Fernandes wrote:
On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
...
I added a printk in __dl_server_attach_root which is called after the
dynamic rd is built to transfer bandwidth to it.
__dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
server interface"), do you have this change in your backport?
You nailed it! Our 5.15 backport appears to be slightly older and is
missing
this from topology.c as you mentioned. Thanks for clarifying!
/*
* Because the rq is not a task, dl_add_task_root_domain()
did not
* move the fair server bw to the rd if it already started.
* Add it now.
*/
if (rq->fair_server.dl_server)
__dl_server_attach_root(&rq->fair_server, rq);
So if rd changes during boot initialization, the correct dl_bw has
to be
updated AFAICS. Also if cpusets are used, the rd for a CPU may
change.
cpusets changes are something that I still need to double check. Will
do.
Sounds good, that would be good to verify.
So, I played a little bit with it and came up with a simple set of ops
that point out an issue (default fedora server install):
echo Y >/sys/kernel/debug/sched/verbose
echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition
The domains are rebuilt correctly, but we end up with a null total_bw.
The conditional call above takes care correctly of adding back dl_server
per-rq bandwidth when we pass from the single domain to the 2 exclusive
ones, but I noticed that we go through partition_sched_domains_locked()
twice for a single write of 'root' and the second one, since it's not
actually destroying/rebuilding anything, is resetting total_bw w/o
addition dl_server contribution back.
Now, not completely sure why we need to go through partition_sched_
domains_locked() twice, as we have (it also looked like a pattern from
other call paths)
update_prstate()
-> update_cpumasks_hier()
-> rebuild_sched_domains_locked() <- right at the end
-> update_partition_sd_lb()
-> rebuild_sched_domains_locked() <- right after the above call
Removing the first call does indeed fix the issue and domains look OK,
but I'm pretty sure I'm missing all sort of details and corner cases.
Waiman (now Cc-ed), maybe you can help here understanding why the two
back to back calls are needed?
Thanks for letting me know about this case.
I am aware that rebuild_sched_domains_locked() can be called more than
once. I have addressed the hotplug case, but it can happen in some
other corner cases as well. The problem with multiple
rebuild_sched_domains_locked() calls is the fact that intermediate
ones may be called where the internal states may not be consistent. I
am going to work on a fix to this issue by making sure that
rebuild_sched_domains_locked() will only be called once.
I am working on a set of cpuset patches to eliminate redundant
rebuild_sched_domains_locked() calls. However, my cpuset test script
fails after the change due to the presence of test cases where the only
CPU in a 1-cpu partition is being offlined. So I sent out a
sched/deadline patch [1] to work around this particular corner case.
[1]
https://lore.kernel.org/lkml/20241108042924.520458-1-longman@xxxxxxxxxx/T/#u
Apparently, the null total_bw bug caused by multiple
rebuild_sched_domains_locked() calls masks this problem.
Anyway, I should be able to post the cpuset patch series next week after
further testing. Please review my sched/deadline patch to see if you are
OK with this minor change.
Cheers,
Longman