Re: RCU nocb list not reclaiming causing OOM

From: David Chen
Date: Fri Jul 27 2018 - 15:07:57 EST


Hi Paul,

I'd like to opinion again on this subject.

So we are going to backport this patch:
6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

But the other one:
8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
It doesn't apply cleanly, and I'm not too comfortable porting it myself.

So I'm wondering if I use the following change to always wake up follower thread
regardless if the list was empty or not, just to be on the safe side. Do you think
this change is reasonable? Do you see any problem it might cause?

Thanks,
David

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h 2017-07-12 06:42:41.000000000 -0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h 2018-07-27 11:57:03.582134519 -0700
@@ -2077,7 +2077,7 @@
tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
smp_mb__after_atomic(); /* Store *tail before wakeup. */
- if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
+ if (rdp != my_rdp) {
/*
* List was empty, wake up the follower.
* Memory barriers supplied by atomic_long_add().


From: David Chen
Sent: Friday, July 20, 2018 5:12 PM
To: paulmck@xxxxxxxxxxxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: RCU nocb list not reclaiming causing OOM
 

Hi Paul,

Ok, I'll try those patches.

Thanks,
David

From: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Sent: Friday, July 20, 2018 4:32:12 PM
To: David Chen
Cc: linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: RCU nocb list not reclaiming causing OOM
 

On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> Hi Paul,
>
> We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> large, and not getting reclaimed, causing the system to OOM.
>
> Printing the culprit rcu_sched_data:
>
>   nocb_q_count = {
>     counter = 32369635
>   },
>   nocb_follower_head = 0xffff88ae901c0a00,
>   nocb_follower_tail = 0xffff88af1538b8d8,
>   nocb_kthread = 0xffff88b06d290000,
>
> As you can see here, the nocb_follower_head is not empty, so in theory, the
> nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
>
> crash> bt 0xffff88b06d290000
> PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
>  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
>  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
>  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
>  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
>  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
>
> And if we dis the address at ffffffff8d112337:
>
> /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
>
> So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> This contradict with the fact that nocb_follower_head was not empty. So I
> wonder if this is caused by the lack of memory barrier in the place shown below.
> If the head is set to NULL after doing xchg, it will overwrite the head set
> by leader. This caused the kthread to sleep the next iteration, and the leader
> won't wake him up as the tail doesn't point to head.
>
> Please tell me what do you think.
>
> Thanks,
> David
>
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> @@ -2149,6 +2149,7 @@
>                BUG_ON(!list);
>                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
>                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> +             smp_mb();
>                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);

The xchg() operation implies full memory barriers both before and after,
so adding the smp_mb() before would have no effect.

But let me take a look at post-4.9 changes to this code...

I suggest trying out the following commit:

6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

If that one doesn't help, the following might be worth trying, but probably
a lot harder to backport:

8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")

Please let me know how it goes!

                                                        Thanx, Paul

------------------------------------------------------------------------

commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Date:   Fri Apr 28 20:11:09 2017 -0700

    rcu: Add memory barriers for NOCB leader wakeup
   
    Wait/wakeup operations do not guarantee ordering on their own.  Instead,
    either locking or memory barriers are required.  This commit therefore
    adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
   
    Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
    Tested-by: Krister Johansen <kjlx@xxxxxxxxxxxxxxxxxx>
    Cc: <stable@xxxxxxxxxxxxxxx> # 4.6.x

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0b1042545116..573fbe9640a0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
         if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
                 /* Prior smp_mb__after_atomic() orders against prior enqueue. */
                 WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
+               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
                 swake_up(&rdp_leader->nocb_wq);
         }
 }
@@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
          * nocb_gp_head, where they await a grace period.
          */
         gotcbs = false;
+       smp_mb(); /* wakeup before ->nocb_head reads. */
         for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
                 rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
                 if (!rdp->nocb_gp_head)