Re: [syzbot] INFO: rcu detected stall in syscall_exit_to_user_mode

From: Thomas Gleixner
Date: Tue Sep 14 2021 - 10:58:18 EST


On Tue, Sep 14 2021 at 20:37, Hillf Danton wrote:

> On Mon, 13 Sep 2021 12:28:14 +0200 Thomas Gleixner wrote:
>>On Tue, Aug 31 2021 at 15:45, Hillf Danton wrote:
>>> On Mon, 30 Aug 2021 12:58:58 +0200 Dmitry Vyukov wrote:
>>>>> ieee80211_iterate_active_interfaces_atomic+0x70/0x180 net/mac80211/util.c:829
>>>>> mac80211_hwsim_beacon+0xd5/0x1a0 drivers/net/wireless/mac80211_hwsim.c:1861
>>>>> __run_hrtimer kernel/time/hrtimer.c:1537 [inline]
>>>>> __hrtimer_run_queues+0x609/0xe50 kernel/time/hrtimer.c:1601
>>>>> hrtimer_run_softirq+0x17b/0x360 kernel/time/hrtimer.c:1618
>>>>> __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
>>>
>>> Add debug info only to help kasan catch the timer running longer than 2 ticks.
>>>
>>> Is it anything in the right direction, tglx?
>>
>>Not really. As Dmitry pointed out this seems to be related to
>
> Thanks for taking a look.
>
>>mac80211_hwsim and if you look at the above stacktrace then how is
>>adding something to the timer wheel helpful?
>
> Given the stall was printed on CPU1 while the supposedly offending timer was
> expiring on CPU0, what was proposed is the lame debug info only for kasan to
> catch the timer red handed.
>
> It is more appreciated if the tglx dude would likely spend a couple of minutes
> giving us a lesson on the expertises needed for collecting evidence that any
> timer runs longer than two ticks. It helps beyond the extent of kasan.

That tglx dude already picked the relevant part of the stack trace (see
also above):

>>>>> ieee80211_iterate_active_interfaces_atomic+0x70/0x180 net/mac80211/util.c:829
>>>>> mac80211_hwsim_beacon+0xd5/0x1a0 drivers/net/wireless/mac80211_hwsim.c:1861
>>>>> __run_hrtimer kernel/time/hrtimer.c:1537 [inline]
>>>>> __hrtimer_run_queues+0x609/0xe50 kernel/time/hrtimer.c:1601
>>>>> hrtimer_run_softirq+0x17b/0x360 kernel/time/hrtimer.c:1618
>>>>> __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

and then asked the question how a timer wheel timer runtime check
helps. He just omitted the appendix "if the timer in question is a
hrtimer" as he assumed that this is pretty obvious from the stack trace.

Aside of that if the wireless timer callback runs in an endless loop,
what is a runtime detection of that in the hrtimer softirq invocation
helping to decode the problem if the stall detector catches it when it
hangs there?

Now that mac80211 hrtimer callback might actually be not the real
problem. It's certainly containing a bunch of loops, but I couldn't find
an endless loop there during a cursory inspection.

But that callback does rearm the hrtimer and that made me look at
hrtimer_run_queues() which might be the reason for the endless loop as
it only terminates when there is no timer to expire anymore.

Now what happens when the mac80211 callback rearms the timer so it
expires immediately again:

hrtimer_forward(&data->beacon_timer, hrtimer_get_expires(timer),
ns_to_ktime(bcn_int * NSEC_PER_USEC));

bcn is a user space controlled value. Now lets assume that bcn_int is <=1,
which would certainly cause the loop in hrtimer_run_queues() to keeping
looping forever.

That should be easy to verify by implementing a simple test which
reschedules a hrtimer from the callback with a expiry time close to now.

Not today as I'm about to head home to fire up the pizza oven.

Thanks,

tglx