Re: rcu: WARNING in rcu_seq_end

From: Paul E. McKenney
Date: Mon Mar 06 2017 - 19:13:21 EST


On Mon, Mar 06, 2017 at 11:11:23AM +0100, Dmitry Vyukov wrote:
> On Mon, Mar 6, 2017 at 11:07 AM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > On Mon, Mar 06, 2017 at 10:24:24AM +0100, Dmitry Vyukov wrote:
> >> On Sun, Mar 5, 2017 at 7:47 PM, Paul E. McKenney
> >> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >> > On Sun, Mar 05, 2017 at 11:50:39AM +0100, Dmitry Vyukov wrote:
> >> >> On Sat, Mar 4, 2017 at 9:40 PM, Paul E. McKenney
> >> >> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >> >> > On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
> >> >> >> Hello,
> >> >> >>
> >> >> >> Paul, you wanted bugs in rcu.
> >> >> >
> >> >> > Well, whether I want them or not, I must deal with them. ;-)
> >> >> >
> >> >> >> I've got this WARNING while running syzkaller fuzzer on
> >> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
> >> >> >>
> >> >> >> ------------[ cut here ]------------
> >> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> >> Kernel panic - not syncing: panic_on_warn set ...
> >> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
> >> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >> >> >> Workqueue: events wait_rcu_exp_gp
> >> >> >> Call Trace:
> >> >> >> __dump_stack lib/dump_stack.c:15 [inline]
> >> >> >> dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
> >> >> >> panic+0x1fb/0x412 kernel/panic.c:179
> >> >> >> __warn+0x1c4/0x1e0 kernel/panic.c:540
> >> >> >> warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> >> rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
> >> >> >> rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
> >> >> >> rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
> >> >> >> wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
> >> >> >> process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
> >> >> >> worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
> >> >> >> kthread+0x326/0x3f0 kernel/kthread.c:227
> >> >> >> ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> >> >> >> Dumping ftrace buffer:
> >> >> >> (ftrace buffer empty)
> >> >> >> Kernel Offset: disabled
> >> >> >> Rebooting in 86400 seconds..
> >> >> >>
> >> >> >>
> >> >> >> Not reproducible. But looking at the code, shouldn't it be:
> >> >> >>
> >> >> >> static void rcu_seq_end(unsigned long *sp)
> >> >> >> {
> >> >> >> smp_mb(); /* Ensure update-side operation before counter increment. */
> >> >> >> + WARN_ON_ONCE(!(*sp & 0x1));
> >> >> >> WRITE_ONCE(*sp, *sp + 1);
> >> >> >> - WARN_ON_ONCE(*sp & 0x1);
> >> >> >> }
> >> >> >>
> >> >> >> ?
> >> >> >>
> >> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
> >> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
> >> >> >> consequently can allow start of next grace periods. Which in turn can
> >> >> >> make the warning fire. Am I missing something?
> >> >> >>
> >> >> >> I don't see any other bad consequences of this. The rest of
> >> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
> >> >> >> returned and destroyed work on stack and next period has started and
> >> >> >> ended, but it seems OK.
> >> >> >
> >> >> > I believe that this is a heygood change, but I don't see how it will
> >> >> > help in this case. BTW, may I have your Signed-off-by?
> >> >> >
> >> >> > The reason I don't believe that it will help is that the
> >> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
> >> >> > is invoked holding ->exp_mutex, and this mutex is not released until
> >> >> > after the handler invokes rcu_seq_end() and then wakes up the task that
> >> >> > scheduled the workqueue handler. So the ordering above should not matter
> >> >> > (but I agree that your ordering is cleaner.
> >> >> >
> >> >> > That said, it looks like I am missing some memory barriers, please
> >> >> > see the following patch.
> >> >> >
> >> >> > But what architecture did you see this on?
> >> >>
> >> >>
> >> >> This is just x86.
> >> >>
> >> >> You seem to assume that wait_event() waits for the wakeup. It does not
> >> >> work this way. It can return as soon as the condition becomes true
> >> >> without ever waiting:
> >> >>
> >> >> 305 #define wait_event(wq, condition) \
> >> >> 306 do { \
> >> >> 307 might_sleep(); \
> >> >> 308 if (condition) \
> >> >> 309 break; \
> >> >> 310 __wait_event(wq, condition); \
> >> >> 311 } while (0)
> >> >
> >> > Agreed, hence my patch in the previous email. I guess I knew that, but
> >>
> >> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
> >
> > No, there is a mutex release and acquisition that do the synchronization,
> > but only if the wakeup has appropriate barriers. The issue is that
> > part of the mutex's critical section executes in a workqueue, possibly
> > on some other CPU.
>
> What is that mutex? And what locks/unlocks provide synchronization? I
> see that one uses exp_mutex and another -- exp_wake_mutex.

Both of them.

->exp_mutex is acquired by the task requesting the grace period, and
the counter's first increment is done by that task under that mutex.
This task then schedules a workqueue, which drives forward the grace
period. Upon grace-period completion, the workqueue handler does the
second increment (the one that your patch addressed). The workqueue
handler then acquires ->exp_wake_mutex and wakes the task that holds
->exp_mutex (along with all other tasks waiting for this grace period),
and that task releases ->exp_mutex, which allows the next grace period to
start (and the first increment for that next grace period to be carried
out under that lock). The workqueue handler releases ->exp_wake_mutex
after finishing its wakeups.

Does that make sense?

Thanx, Paul