Re: BUG: rcu_sched detected stalls on CPUs

From: Paul E. McKenney
Date: Mon May 08 2017 - 11:02:05 EST


On Mon, May 08, 2017 at 06:24:23AM -0700, Paul E. McKenney wrote:
> On Mon, May 08, 2017 at 09:43:15AM +0300, mitko@xxxxxxxxxx wrote:
> > On 2017-01-20 17:19, Steven Rostedt wrote:

[ . . . ]

> > Hello,
> > yesterday I change to new kernel: 4.9.26 and still no effect.
> >
> > I trying to figure out what I need to buy because I read on google a
> > lots of posts about that problem,
> > some ppl suggest is BIOS firmware bug, some ppl tell that when they
> > swap CPU problem is resolved.
> > May be problem was started when I first time boot 4.x kernel and
> > there have cpu microcode updates.
> > Im 3.x kernels this feature was not in kernel, but I don't know,
> > only guess.
> >
> > Can some one point me clearly: You need to change this one and
> > problem will be solved ?
>
> Looking back at your earlier email, I see the following:
>
> [Wed Jan 4 10:19:12 2017] rcu_sched kthread starved for 60004 jiffies! g61967937 c61967936 f0x0 s3 ->state=0x1
>
> That indicates that the task "rcu_sched" isn't being allowed to run.
> The "->state=0x1" indicates that this task is currently asleep and the
> "s3" indicates that it is looking to start a grace period, but has not
> progressed through this task. My guess is that your system get very
> busy, and that rcu_sched's normal scheduling priority is not sufficient
> for it to be allowed to run.
>
> I therefore suggest that you try raising the rcu_sched task's priority.
>
> You can find this task like this:
>
> ps -ef | grep rcu_sched
>
> On my laptop, I get this:
>
> $ ps -ef | grep rcu_sched
> root 7 2 0 Apr11 ? 00:19:19 [rcu_sched]
> paulmck 18307 22926 0 06:11 pts/35 00:00:00 grep --color=auto rcu_sche
>
> You can use the chrt command to set the priority. For example, given that
> on my laptop, rcu_sched is PID 7:
>
> sudo chrt -f -p 1 7
>
> You can double-check as follows:
>
> $ chrt -p 7
> pid 7's current scheduling policy: SCHED_FIFO
> pid 7's current scheduling priority: 1
>
> Does that help?

And I should add that this assumes that the interference is from
within this same OS instance. If you are instead experiencing massive
interference due to other guest OSes on this same system, well, we are
working on it.

Thanx, Paul

> > [239940.067938] clocksource: timekeeping watchdog on CPU26: Marking
> > clocksource 'tsc' as unstable because the skew is too large:
> > [239940.067943] clocksource: 'hpet' wd_now:
> > ecb521ce wd_last: ca45912d mask: ffffffff
> > [239940.067946] clocksource: 'tsc' cs_now:
> > 2d12df5f88d08 cs_last: 29fffbe2d36d6 mask: ffffffffffffffff
> > [239940.068357] clocksource: Switched to clocksource hpet
> > [240000.066457] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [240000.066488] 2-...: (38 GPs behind) idle=f1a/0/0
> > softirq=2417028/2417028 fqs=0
> > [240000.066491] 3-...: (27 GPs behind) idle=0bc/0/0
> > softirq=3076046/3076047 fqs=0
> > [240000.066494] 4-...: (1006 GPs behind) idle=308/0/0
> > softirq=1474922/1474922 fqs=0
> > [240000.066497] 5-...: (8034 GPs behind) idle=7b4/0/0
> > softirq=69165/69166 fqs=0
> > [240000.066499] 6-...: (1665 GPs behind) idle=716/0/0
> > softirq=16465/16465 fqs=0
> > [240000.066502] 7-...: (8034 GPs behind) idle=bd0/0/0
> > softirq=15923/15924 fqs=0
> > [240000.066505] 8-...: (1665 GPs behind) idle=a62/0/0
> > softirq=93861/93861 fqs=0
> > [240000.066507] 9-...: (1258 GPs behind) idle=a8c/0/0
> > softirq=21608/21608 fqs=0
> > [240000.066510] 11-...: (1665 GPs behind) idle=8e6/0/0
> > softirq=18844/18844 fqs=0
> > [240000.066512] 13-...: (1665 GPs behind) idle=ab8/0/0
> > softirq=21278/21278 fqs=0
> > [240000.066515] 14-...: (4854 GPs behind) idle=578/0/0
> > softirq=25624/25626 fqs=0
> > [240000.066517] 15-...: (8034 GPs behind) idle=f5e/0/0
> > softirq=18114/18115 fqs=0
> > [240000.066520] 16-...: (1665 GPs behind) idle=b66/0/0
> > softirq=19511/19511 fqs=0
> > [240000.066528] 17-...: (0 ticks this GP) idle=e18/0/0
> > softirq=16163/16163 fqs=0
> > [240000.066531] 18-...: (1665 GPs behind) idle=a18/0/0
> > softirq=16424/16424 fqs=0
> > [240000.066533] 19-...: (1665 GPs behind) idle=944/0/0
> > softirq=17920/17920 fqs=0
> > [240000.066549] 20-...: (20991 GPs behind) idle=750/0/0
> > softirq=17549/17552 fqs=0
> > [240000.066552] 21-...: (0 ticks this GP) idle=9e0/0/0
> > softirq=18937/18937 fqs=0
> > [240000.066555] 22-...: (14519 GPs behind) idle=95e/0/0
> > softirq=16122/16124 fqs=0
> > [240000.066557] 23-...: (1665 GPs behind) idle=bbc/0/0
> > softirq=16316/16316 fqs=0
> > [240000.066560] 24-...: (0 ticks this GP) idle=c56/0/0
> > softirq=20070/20070 fqs=0
> > [240000.066563] 27-...: (8034 GPs behind) idle=bdc/0/0
> > softirq=18291/18292 fqs=0
> > [240000.066566] 28-...: (8034 GPs behind) idle=4f8/0/0
> > softirq=22313/22314 fqs=0
> > [240000.066568] 29-...: (3443 GPs behind) idle=22a/0/0
> > softirq=24975/24976 fqs=0
> > [240000.066571] 30-...: (1665 GPs behind) idle=940/0/0
> > softirq=89382/89382 fqs=0
> > [240000.066572] (detected by 10, t=60002 jiffies, g=4843721,
> > c=4843720, q=659)
> > [240000.066578] Task dump for CPU 2:
> > [240000.066581] swapper/2 R running task 0 0
> > 1 0x00200000
> > [240000.066588] 0000000000000083 ffffffffa5cecd79 ffff8aca3eea1e00
> > ffffffffa64cab40
> > [240000.066593] 0000000000000005 ffffffffa64cad38 ffffa96c80073eb0
> > ffffffffa5ba9d93
> > [240000.066597] ffff8aca3eea1e00 000000000e94fec6 ffff8aca38535940
> > 0000000000000002
> > [240000.066601] Call Trace:
> > [240000.066617] [<ffffffffa5cecd79>] ? intel_idle+0x109/0x110
> > [240000.066625] [<ffffffffa5ba9d93>] ? cpuidle_enter_state+0x103/0x2b0
> > [240000.066629] [<ffffffffa5ba9f77>] ? cpuidle_enter+0x17/0x20
> > [240000.066638] [<ffffffffa50d5d73>] ? call_cpuidle+0x23/0x40
> > [240000.066641] [<ffffffffa50d5fd7>] ? cpu_startup_entry+0x147/0x220
> > [240000.066647] [<ffffffffa503ecfd>] ? start_secondary+0x12d/0x140
> > [240000.066650] Task dump for CPU 3:
> > [240000.066651] swapper/3 R running task 0 0
> > 1 0x00200000
> >
> > Regards,
> > Mitko
> >