pick_next_task() picking the wrong task [v4.9.163]

From: Radu Rendec
Date: Fri Mar 22 2019 - 17:58:15 EST


Hi Everyone,

I believe I'm seeing a weird behavior of pick_next_task() where it
chooses a lower priority task over a higher priority one. The scheduling
class of the two tasks is also different ('fair' vs. 'rt'). The culprit
seems to be the optimization at the beginning of the function, where
fair_sched_class.pick_next_task() is called directly. I'm running
v4.9.163, but that piece of code is very similar in recent kernels.

My use case is quite simple: I have a real-time thread that is woken up
by a GPIO hardware interrupt. The thread sleeps most of the time in
poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the
interrupt and the thread being woken up/scheduled is very important for
the application. Note that I backported my own commit 03c0a9208bb1, so
the thread is always woken up synchronously from HW interrupt context.

Most of the time things work as expected, but sometimes the scheduler
picks kworker and even the idle task before my real-time thread. I used
the trace infrastructure to figure out what happens and I'm including a
snippet below (I apologize for the wide lines).

<idle>-0 [000] d.h2 161.202970: gpio_sysfs_irq
<-__handle_irq_event_percpu
<idle>-0 [000] d.h2 161.202981: kernfs_notify <-gpio_sysfs_irq
<idle>-0 [000] d.h4 161.202998: sched_waking:
comm=irqWorker pid=1141 prio=9 target_cpu=000
<idle>-0 [000] d.h5 161.203025: sched_wakeup:
comm=irqWorker pid=1141 prio=9 target_cpu=000
<idle>-0 [000] d.h3 161.203047: workqueue_queue_work: work
struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60
req_cpu=1 cpu=0
<idle>-0 [000] d.h3 161.203049: workqueue_activate_work:
work struct 806506b8
<idle>-0 [000] d.h4 161.203061: sched_waking:
comm=kworker/0:1 pid=134 prio=120 target_cpu=000
<idle>-0 [000] d.h5 161.203083: sched_wakeup:
comm=kworker/0:1 pid=134 prio=120 target_cpu=000
<idle>-0 [000] d..2 161.203201: sched_switch:
prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==>
next_comm=kworker/0:1 next_pid=134 next_prio=120
kworker/0:1-134 [000] .... 161.203222: workqueue_execute_start:
work struct 806506b8: function kernfs_notify_workfn
kworker/0:1-134 [000] ...1 161.203286: schedule <-worker_thread
kworker/0:1-134 [000] d..2 161.203329: sched_switch:
prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==>
next_comm=swapper next_pid=0 next_prio=120
<idle>-0 [000] .n.1 161.230287: schedule <-schedule_preempt_disabled
<idle>-0 [000] d..2 161.230310: sched_switch:
prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==>
next_comm=irqWorker next_pid=1141 next_prio=9
irqWorker-1141 [000] d..3 161.230316: finish_task_switch <-schedule

The system is Freescale MPC8378 (PowerPC, single processor).

I instrumented pick_next_task() with trace_printk() and I am sure that
every time the wrong task is picked, flow goes through the optimization
path and idle_sched_class.pick_next_task() is called directly. When the
right task is eventually picked, flow goes through the bottom block that
iterates over all scheduling classes. This probably makes sense: when
the scheduler runs in the context of the idle task, prev->sched_class is
no longer fair_sched_class, so the bottom block with the full iteration
is used. Note that in v4.9.163 the optimization path is taken only when
prev->sched_class is fair_sched_class, whereas in recent kernels it is
taken for both fair_sched_class and idle_sched_class.

Any help or feedback would be much appreciated. In the meantime, I will
experiment with commenting out the optimization (at the expense of a
slower scheduler, of course).

Best regards,
Radu Rendec