Re: rcu: WARNING in rcu_seq_end

From: Dmitry Vyukov
Date: Mon Mar 06 2017 - 05:25:25 EST


On Sun, Mar 5, 2017 at 7:47 PM, Paul E. McKenney
<paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> On Sun, Mar 05, 2017 at 11:50:39AM +0100, Dmitry Vyukov wrote:
>> On Sat, Mar 4, 2017 at 9:40 PM, Paul E. McKenney
>> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>> > On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
>> >> Hello,
>> >>
>> >> Paul, you wanted bugs in rcu.
>> >
>> > Well, whether I want them or not, I must deal with them. ;-)
>> >
>> >> I've got this WARNING while running syzkaller fuzzer on
>> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
>> >>
>> >> ------------[ cut here ]------------
>> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
>> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> Kernel panic - not syncing: panic_on_warn set ...
>> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
>> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >> Workqueue: events wait_rcu_exp_gp
>> >> Call Trace:
>> >> __dump_stack lib/dump_stack.c:15 [inline]
>> >> dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>> >> panic+0x1fb/0x412 kernel/panic.c:179
>> >> __warn+0x1c4/0x1e0 kernel/panic.c:540
>> >> warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
>> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
>> >> rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
>> >> rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
>> >> wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
>> >> process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
>> >> worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
>> >> kthread+0x326/0x3f0 kernel/kthread.c:227
>> >> ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
>> >> Dumping ftrace buffer:
>> >> (ftrace buffer empty)
>> >> Kernel Offset: disabled
>> >> Rebooting in 86400 seconds..
>> >>
>> >>
>> >> Not reproducible. But looking at the code, shouldn't it be:
>> >>
>> >> static void rcu_seq_end(unsigned long *sp)
>> >> {
>> >> smp_mb(); /* Ensure update-side operation before counter increment. */
>> >> + WARN_ON_ONCE(!(*sp & 0x1));
>> >> WRITE_ONCE(*sp, *sp + 1);
>> >> - WARN_ON_ONCE(*sp & 0x1);
>> >> }
>> >>
>> >> ?
>> >>
>> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
>> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
>> >> consequently can allow start of next grace periods. Which in turn can
>> >> make the warning fire. Am I missing something?
>> >>
>> >> I don't see any other bad consequences of this. The rest of
>> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
>> >> returned and destroyed work on stack and next period has started and
>> >> ended, but it seems OK.
>> >
>> > I believe that this is a heygood change, but I don't see how it will
>> > help in this case. BTW, may I have your Signed-off-by?
>> >
>> > The reason I don't believe that it will help is that the
>> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
>> > is invoked holding ->exp_mutex, and this mutex is not released until
>> > after the handler invokes rcu_seq_end() and then wakes up the task that
>> > scheduled the workqueue handler. So the ordering above should not matter
>> > (but I agree that your ordering is cleaner.
>> >
>> > That said, it looks like I am missing some memory barriers, please
>> > see the following patch.
>> >
>> > But what architecture did you see this on?
>>
>>
>> This is just x86.
>>
>> You seem to assume that wait_event() waits for the wakeup. It does not
>> work this way. It can return as soon as the condition becomes true
>> without ever waiting:
>>
>> 305 #define wait_event(wq, condition) \
>> 306 do { \
>> 307 might_sleep(); \
>> 308 if (condition) \
>> 309 break; \
>> 310 __wait_event(wq, condition); \
>> 311 } while (0)
>
> Agreed, hence my patch in the previous email. I guess I knew that, but

Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
I think you placed the barrier incorrectly for that. rcu_exp_wait_wake
is already too late. The write that unblocks waiter is in rcu_seq_end
so you need a release barrier _before_ that write.
Also can we please start using smp_load_acquire/smp_store_release
where they are what doctor said. They are faster, more readable,
better for race detectors _and_ would prevent you from introducing
this bug, because you would need to find the exact write that
signifies completion. I.e.:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d80c2587bed8..aa7ba83f6a56 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp)
static void rcu_seq_end(unsigned long *sp)
{
smp_mb(); /* Ensure update-side operation before counter increment. */
- WRITE_ONCE(*sp, *sp + 1);
+ smp_store_release(sp, *sp + 1);
WARN_ON_ONCE(*sp & 0x1);
}

@@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp)
*/
static bool rcu_seq_done(unsigned long *sp, unsigned long s)
{
- return ULONG_CMP_GE(READ_ONCE(*sp), s);
+ return ULONG_CMP_GE(smp_load_acquire(sp), s);
}



> on the day I wrote that code, my fingers didn't. Or somew similar lame
> excuse. ;-)
>
>> Mailed a signed patch:
>> https://groups.google.com/d/msg/syzkaller/XzUXuAzKkCw/5054wU9MEAAJ
>
> This is the patch you also sent by email, that moves the WARN_ON_ONCE(),
> thank you!