Latency spikes from within the scheduler.

From: Sebastian Andrzej Siewior
Date: Fri Nov 04 2022 - 07:47:49 EST


Hi,

I've been playing with v6.1-rc3-rt2 on a 256 CPU machine. Some of my
findings:
- it can take some time to grab the RQ lock. That is to got on the CPU
_or_ to leave the CPU. For instance an 1ms example:
| cyclictest-4619 [011] d...2.. 1128.288674: idle_cpu <-find_busiest_group
| cyclictest-4619 [011] d...2.. 1128.288674: idle_cpu <-find_busiest_group
| cyclictest-4619 [011] d...2.. 1128.288675: raw_spin_rq_lock_nested <-load_balance
| cyclictest-4619 [011] d...3.. 1128.288675: _raw_spin_lock <-raw_spin_rq_lock_nested
| cyclictest-4619 [011] d...4.. 1128.288676: queued_spin_lock_slowpath <-raw_spin_rq_lock_nested
| cyclictest-4619 [011] d...3.. 1128.289688: update_rq_clock <-load_balance
| cyclictest-4619 [011] d...3.. 1128.289688: can_migrate_task <-load_balance
| cyclictest-4619 [011] d...3.. 1128.289688: kthread_is_per_cpu <-can_migrate_task

https://breakpoint.cc/traces-zen3/trace-load_balance_rq_lock-11.txt.zst [77KiB]
https://breakpoint.cc/traces-zen3/trace-load_balance_rq_lock.txt.zst [34MiB]

- A CPU can be attacked by rto_push_irq_work_func(). A few instances are
probably okay but doing it 86 times can sum up to ~400us if called from
flush_smp_call_function_queue() (after waiting for the RQ lock):
| <idle>-0 [032] d...2.. 4069.605763: irq_enter_rcu <-sysvec_apic_timer_interrupt
| <idle>-0 [032] dn.h6.. 4069.605929: queued_spin_lock_slowpath <-enqueue_task_rt

| <idle>-0 [032] d..h6.. 4069.605766: queued_spin_lock_slowpath <-try_to_wake_up
| <idle>-0 [032] dn.h5.. 4069.605923: update_rq_clock <-try_to_wake_up

| <idle>-0 [032] dn.h6.. 4069.605929: queued_spin_lock_slowpath <-enqueue_task_rt
| <idle>-0 [032] dn.h5.. 4069.606167: enqueue_top_rt_rq <-enqueue_task_rt
| <idle>-0 [032] dn.h4.. 4069.606169: sched_wakeup: comm=cyclictest pid=8045 prio=9 target_cpu=032

| <idle>-0 [032] dn.h2.. 4069.606171: irq_exit_rcu <-sysvec_apic_timer_interrupt
| <idle>-0 [032] .n..2.. 4069.606171: arch_cpu_idle_exit <-do_idle
| <idle>-0 [032] .n..2.. 4069.606171: flush_smp_call_function_queue <-do_idle
| <idle>-0 [032] dn..2.. 4069.606172: __flush_smp_call_function_queue <-flush_smp_call_function_queue
| <idle>-0 [032] dn..2.. 4069.606173: rto_push_irq_work_func <-irq_work_single

| <idle>-0 [032] dn.h2.. 4069.606578: irq_exit_rcu <-sysvec_irq_work
| <idle>-0 [032] .n..2.. 4069.606578: schedule_idle <-do_idle
| <idle>-0 [032] dn..2.. 4069.606579: rcu_note_context_switch <-__schedule

https://breakpoint.cc/traces-zen3/trace-rto_push_irq_work_func-32.txt.zst [102KiB]
https://breakpoint.cc/traces-zen3/trace-rto_push_irq_work_func.txt.zst [49MiB]

- I also managed to run into sched_rt_period_timer():
| <idle>-0 [027] dn.h1.. 630.600139: hrtimer_expire_entry: hrtimer=00000000602b5d49 function=sched_rt_period_timer now=630600137263
| <idle>-0 [027] dn.h2.. 630.600139: sched_rt_period_timer <-__hrtimer_run_queues
| <idle>-0 [027] dn.h2.. 630.600139: _raw_spin_lock <-sched_rt_period_timer
| <idle>-0 [027] dn.h3.. 630.600140: queued_spin_lock_slowpath <-sched_rt_period_timer
| <idle>-0 [027] dn.h3.. 630.600326: ktime_get <-sched_rt_period_timer

| <idle>-0 [027] dn.h3.. 630.606825: hrtimer_forward <-sched_rt_period_timer
| <idle>-0 [027] dn.h2.. 630.606826: _raw_spin_lock_irq <-__hrtimer_run_queues
| <idle>-0 [027] dn.h3.. 630.606826: enqueue_hrtimer <-__hrtimer_run_queues
| <idle>-0 [027] dn.h2.. 630.606826: hrtimer_start: hrtimer=00000000602b5d49 function=sched_rt_period_timer expires=631600000000 softexpires=631600000000 mode=ABS

That is almost 7ms of runtime just for that one hrtimer ensuring that
an RT task is not running longer than needing.

https://breakpoint.cc/traces-zen3/trace-sched_rt_period_timer-27.txt.zst [101K]
https://breakpoint.cc/traces-zen3/trace-sched_rt_period_timer.txt.zst [48MiB]

The -XX trace is just from that one CPU, the other the complete one. I
also have them as an all-in-one archive:
https://breakpoint.cc/traces-zen3/trace-all-in-one.tar.zst

And the dot-config:
https://breakpoint.cc/traces-zen3/dot-config.tar.zst

The first two were run with CONFIG-PERIODIC, the last one with CONFIG-NO_HZ.

Sebastian