Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority

From: Doug Anderson
Date: Tue Nov 30 2021 - 11:30:29 EST


Hi,

On Mon, Nov 15, 2021 at 5:03 PM Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> While testing RT_GROUP_SCHED, I found that my system would go bonkers
> if my test RT tasks ever got throttled (even if my test RT tasks were
> set to only get a tiny slice of CPU time). Specifically I found that
> whenever my test RT tasks were throttled that all other RT tasks in
> the system were being starved (!!). Several important RT tasks in the
> kernel were suddenly getting almost no timeslices and my system became
> unusable.
>
> After some experimentation, I determined that this behavior only
> happened when I gave my test RT tasks a high priority. If I gave my
> test RT tasks a low priority then they were throttled as expected and
> nothing was starved.
>
> I managed to come up with a test case that hopefully anyone can run to
> demonstrate the problem. The test case uses shell commands and python
> but certainly you could reproduce in other ways:
>
> echo "Allow 20 ms more of RT at system and top cgroup"
> old_rt=$(cat /proc/sys/kernel/sched_rt_runtime_us)
> echo $((old_rt + 20000)) > /proc/sys/kernel/sched_rt_runtime_us
> old_rt=$(cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us)
> echo $((old_rt + 20000)) > /sys/fs/cgroup/cpu/cpu.rt_runtime_us
>
> echo "Give 10 ms each to spinny and printy groups"
> mkdir /sys/fs/cgroup/cpu/spinny
> echo 10000 > /sys/fs/cgroup/cpu/spinny/cpu.rt_runtime_us
> mkdir /sys/fs/cgroup/cpu/printy
> echo 10000 > /sys/fs/cgroup/cpu/printy/cpu.rt_runtime_us
>
> echo "Fork off a printy thing to be a nice RT citizen"
> echo "Prints once per second. Priority only 1."
> python -c "import time;
> last_time = time.time()
> while True:
> time.sleep(1)
> now_time = time.time()
> print('Time fies %f' % (now_time - last_time))
> last_time = now_time" &
> pid=$!
> echo "Give python a few seconds to get started"
> sleep 3
> echo $pid >> /sys/fs/cgroup/cpu/printy/tasks
> chrt -p -f 1 $pid
>
> echo "Sleep to observe that everything is peachy"
> sleep 3
>
> echo "Fork off a bunch of evil spinny things"
> echo "Chews CPU time. Priority 99."
> for i in $(seq 13); do
> python -c "while True: pass"&
> pid=$!
> echo $pid >> /sys/fs/cgroup/cpu/spinny/tasks
> chrt -p -f 99 $pid
> done
>
> echo "Huh? Almost no more prints?"
>
> I believe that the problem is an "if" test that's been in
> push_rt_task() forever where we will just reschedule the current task
> if it's higher priority than the next one. If I just remove that
> special case then everything works for me. I tried making it
> conditional on just `!rq->rt.rt_throttled` but for whatever reason
> that wasn't enough. The `if` test looks like an unlikely special case
> optimization and it seems like things ought to be fine without it.
>
> Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> ---
> I know less than zero about the scheduler (so if I told you something,
> it's better than 50% chance the the opposite is true!). Here I'm
> asserting that we totally don't need this special case and the system
> will be fine without it, but I actually don't have any data to back
> that up. If nothing else, hopefully my test case in the commit message
> would let someone else reproduce and see what I'm talking about and
> can come up with a better fix.
>
> kernel/sched/rt.c | 10 ----------
> 1 file changed, 10 deletions(-)

I know this isn't crazy urgent and I'm happy to sit and twiddle my
thumbs a bit longer if everyone is still sleepy from tryptophan, but
I'm curious if anyone had a chance to look at this. Can anyone confirm
that my script reproduces for them on something other than my system?
Does my patch seem sane?

Thanks!

-Doug