Re: rcu_preempt caused oom

From: Paul E. McKenney
Date: Thu Dec 13 2018 - 21:15:41 EST


On Fri, Dec 14, 2018 at 01:30:04AM +0000, He, Bo wrote:
> as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:

Yes, you are correct, CONFIG_RCU_FAST_NO_HZ. OK, you do not have it set,
which means several code paths can be ignored. Also CONFIG_HZ=1000, so
300 second delay.

Thanx, Paul

> Here is the grep from .config:
> egrep "HZ|RCU" .config
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> CONFIG_NO_HZ=y
> # RCU Subsystem
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # CONFIG_HZ_100 is not set
> # CONFIG_HZ_250 is not set
> # CONFIG_HZ_300 is not set
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> # CONFIG_MACHZ_WDT is not set
> # RCU Debugging
> CONFIG_PROVE_RCU=y
> CONFIG_RCU_PERF_TEST=m
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_RCU_CPU_STALL_TIMEOUT=7
> CONFIG_RCU_TRACE=y
> CONFIG_RCU_EQS_DEBUG=y
>
> -----Original Message-----
> From: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
> Sent: Friday, December 14, 2018 2:12 AM
> To: He, Bo <bo.he@xxxxxxxxx>
> Cc: Zhang, Jun <jun.zhang@xxxxxxxxx>; Steven Rostedt <rostedt@xxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx; mathieu.desnoyers@xxxxxxxxxxxx; jiangshanlai@xxxxxxxxx; Xiao, Jin <jin.xiao@xxxxxxxxx>; Zhang, Yanmin <yanmin.zhang@xxxxxxxxx>; Bai, Jie A <jie.a.bai@xxxxxxxxx>; Sun, Yi J <yi.j.sun@xxxxxxxxx>
> Subject: Re: rcu_preempt caused oom
>
> On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> > one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
> >
> > [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq 21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1
>
> This is quite helpful, thank you!
>
> The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good. The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request. The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
> The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.
>
> The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
> The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period. So if the rcu_preempt task were to wake up, it would process the new grace period. Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.
>
> But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code). So why is rcu_check_gp_start_stall() not being called? Here are a couple of possibilities:
>
> 1. Because rcu_check_gp_start_stall() is only ever invoked from
> RCU_SOFTIRQ, it is possible that softirqs are stalled for
> whatever reason.
>
> 2. Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
> interrupt handler, it is possible that the scheduler tick has
> somehow been disabled. Traces from earlier runs showed a great
> deal of RCU callbacks queued, which would have caused RCU to
> refuse to allow the scheduler tick to be disabled, even if the
> corresponding CPU was idle.
>
> 3. You have CONFIG_FAST_NO_HZ=y (which you probably do, given
> that you are building for a battery-powered device) and all of the
> CPU's callbacks are lazy. Except that your earlier traces showed
> lots of non-lazy callbacks. Besides, even if all callbacks were
> lazy, there would still be a scheduling-clock interrupt every
> six seconds, and there are quite a few six-second intervals
> in a two-minute watchdog timeout.
>
> But if we cannot find the problem quickly, I will likely ask
> you to try reproducing with CONFIG_FAST_NO_HZ=n. This could
> be thought of as bisecting the RCU code looking for the bug.
>
> The first two of these seem unlikely given that the watchdog timer was still firing. Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise. Could you please check?
> One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall(). It might be necessary to no-inline rcu_check_gp_start_stall(). You might have better ways to collect this information.
>
> Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.
>
> You do have a lockdep complaint early at boot. Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed. Sometimes the consequences of this sort of deadlock can propagate to unexepected places.
>
> Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set. Perhaps there is some code path that drops the wakeup.
> I did check this in current -rcu, but you are instead running v4.19, so I should also check there.
>
> The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup(). We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
> Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).
>
> The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp(). We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n. Then rcu_accelerate_cbs() is invoked from:
>
> o rcu_accelerate_cbs_unlocked(), which does the following, thus
> properly awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_advance_cbs(), which returns the value returned by
> rcu_accelerate_cbs(), thus pushing the problem off to its
> callers, which are called out below.
>
> o __note_gp_changes(), which also returns the value returned by
> rcu_accelerate_cbs(), thus pushing the problem off to its callers,
> which are called out below.
>
> o rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
> kthreads such as the rcu_preempt task. Therefore, this function
> never needs to awaken the rcu_preempt task, because the fact
> that this function is executing means that this task is already
> awake. (Also, as noted above, we can eliminate this code from
> consideration because this task is known to have been sleeping
> at the last time that the RCU_GP_FLAG_INIT bit was set.)
>
> o rcu_report_qs_rdp(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
>
> rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> /* ^^^ Released rnp->lock */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_prepare_for_idle(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> Now for rcu_advance_cbs():
>
> o __note_gp_changes(), which which also returns the value returned
> by rcu_advance_cbs(), thus pushing the problem off to its callers,
> which are called out below.
>
> o rcu_migrate_callbacks(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
> rcu_advance_cbs(rsp, rnp_root, my_rdp);
> rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
> WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
> !rcu_segcblist_n_cbs(&my_rdp->cblist));
> raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> Now for __note_gp_changes():
>
> o note_gp_changes(), which does the following, thus properly
> awakening the rcu_preempt task when needed:
>
> needwake = __note_gp_changes(rsp, rnp, rdp);
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> if (needwake)
> rcu_gp_kthread_wake(rsp);
>
> o rcu_gp_init() which is only ever invoked by RCU grace-period
> kthreads such as the rcu_preempt task, which makes wakeups
> unnecessary, just as for rcu_gp_cleanup() above.
>
> o rcu_gp_cleanup(), ditto.
>
> So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis. One way to do that is using event tracing.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
> lockdep complaint:
> ------------------------------------------------------------------------
>
> [ 2.895507] ======================================================
> [ 2.895511] WARNING: possible circular locking dependency detected
> [ 2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G U
> [ 2.895521] ------------------------------------------------------
> [ 2.895525] earlyEvs/1839 is trying to acquire lock:
> [ 2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895546]
> [ 2.895546] but task is already holding lock:
> [ 2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> [ 2.895561]
> [ 2.895561] which lock already depends on the new lock.
> [ 2.895561]
> [ 2.895566]
> [ 2.895566] the existing dependency chain (in reverse order) is:
> [ 2.895570]
> [ 2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
> [ 2.895583] __mutex_lock+0x80/0x9a0
> [ 2.895588] mutex_lock_nested+0x1b/0x20
> [ 2.895593] media_device_register_entity+0x92/0x1e0
> [ 2.895598] v4l2_device_register_subdev+0xc2/0x1b0
> [ 2.895604] ipu_isys_csi2_init+0x22c/0x520
> [ 2.895608] isys_probe+0x6cb/0xed0
> [ 2.895613] ipu_bus_probe+0xfd/0x2e0
> [ 2.895620] really_probe+0x268/0x3d0
> [ 2.895625] driver_probe_device+0x11a/0x130
> [ 2.895630] __device_attach_driver+0x86/0x100
> [ 2.895635] bus_for_each_drv+0x6e/0xb0
> [ 2.895640] __device_attach+0xdf/0x160
> [ 2.895645] device_initial_probe+0x13/0x20
> [ 2.895650] bus_probe_device+0xa6/0xc0
> [ 2.895655] deferred_probe_work_func+0x88/0xe0
> [ 2.895661] process_one_work+0x220/0x5c0
> [ 2.895665] worker_thread+0x1da/0x3b0
> [ 2.895670] kthread+0x12c/0x150
> [ 2.895675] ret_from_fork+0x3a/0x50
> [ 2.895678]
> [ 2.895678] -> #0 (&asd->mutex){+.+.}:
> [ 2.895688] lock_acquire+0x95/0x1a0
> [ 2.895693] __mutex_lock+0x80/0x9a0
> [ 2.895698] mutex_lock_nested+0x1b/0x20
> [ 2.895703] ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895708] ipu_isys_csi2_get_fmt+0x14/0x30
> [ 2.895713] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.895718] v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.895723] v4l2_subdev_link_validate+0x246/0x490
> [ 2.895728] csi2_link_validate+0xc6/0x220
> [ 2.895733] __media_pipeline_start+0x15b/0x2f0
> [ 2.895738] media_pipeline_start+0x33/0x50
> [ 2.895743] ipu_isys_video_prepare_streaming+0x1e0/0x610
> [ 2.895748] start_streaming+0x186/0x3a0
> [ 2.895753] vb2_start_streaming+0x6d/0x130
> [ 2.895758] vb2_core_streamon+0x108/0x140
> [ 2.895762] vb2_streamon+0x29/0x50
> [ 2.895767] vb2_ioctl_streamon+0x42/0x50
> [ 2.895772] v4l_streamon+0x20/0x30
> [ 2.895776] __video_do_ioctl+0x1af/0x3c0
> [ 2.895781] video_usercopy+0x27e/0x7e0
> [ 2.895785] video_ioctl2+0x15/0x20
> [ 2.895789] v4l2_ioctl+0x49/0x50
> [ 2.895794] do_video_ioctl+0x93c/0x2360
> [ 2.895799] v4l2_compat_ioctl32+0x93/0xe0
> [ 2.895806] __ia32_compat_sys_ioctl+0x73a/0x1c90
> [ 2.895813] do_fast_syscall_32+0x9a/0x2d6
> [ 2.895818] entry_SYSENTER_compat+0x6d/0x7c
> [ 2.895821]
> [ 2.895821] other info that might help us debug this:
> [ 2.895821]
> [ 2.895826] Possible unsafe locking scenario:
> [ 2.895826]
> [ 2.895830] CPU0 CPU1
> [ 2.895833] ---- ----
> [ 2.895836] lock(&mdev->graph_mutex);
> [ 2.895842] lock(&asd->mutex);
> [ 2.895847] lock(&mdev->graph_mutex);
> [ 2.895852] lock(&asd->mutex);
> [ 2.895857]
> [ 2.895857] *** DEADLOCK ***
> [ 2.895857]
> [ 2.895863] 3 locks held by earlyEvs/1839:
> [ 2.895866] #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
> [ 2.895876] #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
> [ 2.895886] #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
> [ 2.895896]
> [ 2.895896] stack backtrace:
> [ 2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G U 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
> [ 2.895907] Call Trace:
> [ 2.895915] dump_stack+0x70/0xa5
> [ 2.895921] print_circular_bug.isra.35+0x1d8/0x1e6
> [ 2.895927] __lock_acquire+0x1284/0x1340
> [ 2.895931] ? __lock_acquire+0x2b5/0x1340
> [ 2.895940] lock_acquire+0x95/0x1a0
> [ 2.895945] ? lock_acquire+0x95/0x1a0
> [ 2.895950] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895956] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895961] __mutex_lock+0x80/0x9a0
> [ 2.895966] ? ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895971] ? crlmodule_get_format+0x43/0x50
> [ 2.895979] mutex_lock_nested+0x1b/0x20
> [ 2.895984] ? mutex_lock_nested+0x1b/0x20
> [ 2.895989] ipu_isys_subdev_get_ffmt+0x32/0x90
> [ 2.895995] ipu_isys_csi2_get_fmt+0x14/0x30
> [ 2.896001] v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.896006] v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.896011] ? crlmodule_get_format+0x2a/0x50
> [ 2.896018] ? find_held_lock+0x35/0xa0
> [ 2.896023] ? crlmodule_get_format+0x43/0x50
> [ 2.896030] v4l2_subdev_link_validate+0x246/0x490
> [ 2.896035] ? __mutex_unlock_slowpath+0x58/0x2f0
> [ 2.896042] ? mutex_unlock+0x12/0x20
> [ 2.896046] ? crlmodule_get_format+0x43/0x50
> [ 2.896052] ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
> [ 2.896057] ? v4l2_subdev_link_validate_one+0x67/0x120
> [ 2.896065] ? __is_insn_slot_addr+0xad/0x120
> [ 2.896070] ? kernel_text_address+0xc4/0x100
> [ 2.896078] ? v4l2_subdev_link_validate+0x246/0x490
> [ 2.896085] ? kernel_text_address+0xc4/0x100
> [ 2.896092] ? __lock_acquire+0x1106/0x1340
> [ 2.896096] ? __lock_acquire+0x1169/0x1340
> [ 2.896103] csi2_link_validate+0xc6/0x220
> [ 2.896110] ? __lock_is_held+0x5a/0xa0
> [ 2.896115] ? mark_held_locks+0x58/0x80
> [ 2.896122] ? __kmalloc+0x207/0x2e0
> [ 2.896127] ? __lock_is_held+0x5a/0xa0
> [ 2.896134] ? rcu_read_lock_sched_held+0x81/0x90
> [ 2.896139] ? __kmalloc+0x2a3/0x2e0
> [ 2.896144] ? media_pipeline_start+0x28/0x50
> [ 2.896150] ? __media_entity_enum_init+0x33/0x70
> [ 2.896155] ? csi2_has_route+0x18/0x20
> [ 2.896160] ? media_graph_walk_next.part.9+0xac/0x290
> [ 2.896166] __media_pipeline_start+0x15b/0x2f0
> [ 2.896173] ? rcu_read_lock_sched_held+0x81/0x90
> [ 2.896179] media_pipeline_start+0x33/0x50
> [ 2.896186] ipu_isys_video_prepare_streaming+0x1e0/0x610
> [ 2.896191] ? __lock_acquire+0x132e/0x1340
> [ 2.896198] ? __lock_acquire+0x2b5/0x1340
> [ 2.896204] ? lock_acquire+0x95/0x1a0
> [ 2.896209] ? start_streaming+0x5c/0x3a0
> [ 2.896215] ? start_streaming+0x5c/0x3a0
> [ 2.896221] ? __mutex_lock+0x391/0x9a0
> [ 2.896226] ? v4l_enable_media_source+0x2d/0x70
> [ 2.896233] ? find_held_lock+0x35/0xa0
> [ 2.896238] ? v4l_enable_media_source+0x57/0x70
> [ 2.896245] start_streaming+0x186/0x3a0
> [ 2.896250] ? __mutex_unlock_slowpath+0x58/0x2f0
> [ 2.896257] vb2_start_streaming+0x6d/0x130
> [ 2.896262] ? vb2_start_streaming+0x6d/0x130
> [ 2.896267] vb2_core_streamon+0x108/0x140
> [ 2.896273] vb2_streamon+0x29/0x50
> [ 2.896278] vb2_ioctl_streamon+0x42/0x50
> [ 2.896284] v4l_streamon+0x20/0x30
> [ 2.896288] __video_do_ioctl+0x1af/0x3c0
> [ 2.896296] ? __might_fault+0x85/0x90
> [ 2.896302] video_usercopy+0x27e/0x7e0
> [ 2.896307] ? copy_overflow+0x20/0x20
> [ 2.896313] ? find_held_lock+0x35/0xa0
> [ 2.896319] ? __might_fault+0x3e/0x90
> [ 2.896325] video_ioctl2+0x15/0x20
> [ 2.896330] v4l2_ioctl+0x49/0x50
> [ 2.896335] do_video_ioctl+0x93c/0x2360
> [ 2.896343] v4l2_compat_ioctl32+0x93/0xe0
> [ 2.896349] __ia32_compat_sys_ioctl+0x73a/0x1c90
> [ 2.896354] ? lockdep_hardirqs_on+0xef/0x180
> [ 2.896359] ? do_fast_syscall_32+0x3b/0x2d6
> [ 2.896364] do_fast_syscall_32+0x9a/0x2d6
> [ 2.896370] entry_SYSENTER_compat+0x6d/0x7c
> [ 2.896377] RIP: 0023:0xf7e79b79
> [ 2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> [ 2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
> [ 2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
> [ 2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
> [ 2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
> [ 2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>
> ------------------------------------------------------------------------
>
> > [17818.936039] rcu: rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> > [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 - >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0
> > [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 - >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
> >
> > -----Original Message-----
> > From: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
> > Sent: Thursday, December 13, 2018 12:40 PM
> > To: Zhang, Jun <jun.zhang@xxxxxxxxx>
> > Cc: He, Bo <bo.he@xxxxxxxxx>; Steven Rostedt <rostedt@xxxxxxxxxxx>;
> > linux-kernel@xxxxxxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx;
> > mathieu.desnoyers@xxxxxxxxxxxx; jiangshanlai@xxxxxxxxx; Xiao, Jin
> > <jin.xiao@xxxxxxxxx>; Zhang, Yanmin <yanmin.zhang@xxxxxxxxx>; Bai, Jie
> > A <jie.a.bai@xxxxxxxxx>; Sun, Yi J <yi.j.sun@xxxxxxxxx>
> > Subject: Re: rcu_preempt caused oom
> >
> > On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > > Ok, we will test it, thanks!
> >
> > But please also try the sysrq-y with the earlier patch after a hang!
> >
> > Thanx, Paul
> >
> > > -----Original Message-----
> > > From: Paul E. McKenney [mailto:paulmck@xxxxxxxxxxxxx]
> > > Sent: Thursday, December 13, 2018 10:43
> > > To: Zhang, Jun <jun.zhang@xxxxxxxxx>
> > > Cc: He, Bo <bo.he@xxxxxxxxx>; Steven Rostedt <rostedt@xxxxxxxxxxx>;
> > > linux-kernel@xxxxxxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx;
> > > mathieu.desnoyers@xxxxxxxxxxxx; jiangshanlai@xxxxxxxxx; Xiao, Jin
> > > <jin.xiao@xxxxxxxxx>; Zhang, Yanmin <yanmin.zhang@xxxxxxxxx>; Bai,
> > > Jie A <jie.a.bai@xxxxxxxxx>; Sun, Yi J <yi.j.sun@xxxxxxxxx>
> > > Subject: Re: rcu_preempt caused oom
> > >
> > > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > > Hello, Paul
> > > >
> > > > I think the next patch is better.
> > > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > > Please help review.
> > > > I don't test it. If you agree, we will test it.
> > >
> > > Just to make sure that I understand, you are worried about something like the following, correct?
> > >
> > > o __note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > > and finds them equal.
> > >
> > > o At just this time something like rcu_start_this_gp() assigns a new
> > > (larger) value to rdp->gp_seq_needed.
> > >
> > > o Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > > old value.
> > >
> > > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer. So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> > >
> > > Of course, there could be bugs. For example:
> > >
> > > o __note_gp_changes() might be called on a different CPU than that
> > > corresponding to rdp. You can check this with something like:
> > >
> > > WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > >
> > > o The same things could happen with rcu_start_this_gp(), and the
> > > above WARN_ON_ONCE() would work there as well.
> > >
> > > o rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > > you are doing CPU-hotplug operations. (It can run on a CPU other
> > > than rdp->cpu, but only at times when rdp->cpu is offline.)
> > >
> > > o Interrupts might not really be disabled.
> > >
> > > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time. So it might be worth testing just for that reason.
> > >
> > > So why not just test it anyway? If it makes the bug go away, I will
> > > be surprised, but it would not be the first surprise for me. ;-)
> > >
> > > Thanx, Paul
> > >
> > > > Thanks!
> > > >
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > 0b760c1..c00f34e 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > > > zero_cpu_stall_ticks(rdp);
> > > > }
> > > > rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */
> > > > - if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > > + if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed)
> > > > + ||
> > > > + rdp->gpwrap)
> > > > rdp->gp_seq_needed = rnp->gp_seq_needed;
> > > > WRITE_ONCE(rdp->gpwrap, false);
> > > > rcu_gpnum_ovf(rnp, rdp);
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Paul E. McKenney [mailto:paulmck@xxxxxxxxxxxxx]
> > > > Sent: Thursday, December 13, 2018 08:12
> > > > To: He, Bo <bo.he@xxxxxxxxx>
> > > > Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>;
> > > > linux-kernel@xxxxxxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx;
> > > > mathieu.desnoyers@xxxxxxxxxxxx; jiangshanlai@xxxxxxxxx; Zhang, Jun
> > > > <jun.zhang@xxxxxxxxx>; Xiao, Jin <jin.xiao@xxxxxxxxx>; Zhang,
> > > > Yanmin <yanmin.zhang@xxxxxxxxx>; Bai, Jie A <jie.a.bai@xxxxxxxxx>;
> > > > Sun, Yi J <yi.j.sun@xxxxxxxxx>
> > > > Subject: Re: rcu_preempt caused oom
> > > >
> > > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > > Please correct me if I have something wrong.
> > > >
> > > > That would be because I sent you the wrong patch, apologies! :-/
> > > >
> > > > Please instead see the one below, which does add sysrq_rcu.
> > > >
> > > > Thanx, Paul
> > > >
> > > > > -----Original Message-----
> > > > > From: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
> > > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > > To: He, Bo <bo.he@xxxxxxxxx>
> > > > > Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>;
> > > > > linux-kernel@xxxxxxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx;
> > > > > mathieu.desnoyers@xxxxxxxxxxxx; jiangshanlai@xxxxxxxxx; Zhang,
> > > > > Jun <jun.zhang@xxxxxxxxx>; Xiao, Jin <jin.xiao@xxxxxxxxx>;
> > > > > Zhang, Yanmin <yanmin.zhang@xxxxxxxxx>; Bai, Jie A
> > > > > <jie.a.bai@xxxxxxxxx>
> > > > > Subject: Re: rcu_preempt caused oom
> > > > >
> > > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > > >
> > > > > > Strange.
> > > > > >
> > > > > > Are the systems responsive to sysrq keys once failure occurs?
> > > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > > >
> > > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below. Only lightly tested.
> > > >
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > --
> > > >
> > > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > > Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
> > > > Date: Wed Dec 12 16:10:09 2018 -0800
> > > >
> > > > rcu: Add sysrq rcu_node-dump capability
> > > >
> > > > Backported from v4.21/v5.0
> > > >
> > > > Life is hard if RCU manages to get stuck without triggering RCU CPU
> > > > stall warnings or triggering the rcu_check_gp_start_stall() checks
> > > > for failing to start a grace period. This commit therefore adds a
> > > > boot-time-selectable sysrq key (commandeering "y") that allows manually
> > > > dumping Tree RCU state. The new rcutree.sysrq_rcu kernel boot parameter
> > > > must be set for this sysrq to be available.
> > > >
> > > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > > 0b760c1369f7..e9392a9d6291 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -61,6 +61,7 @@
> > > > #include <linux/trace_events.h>
> > > > #include <linux/suspend.h>
> > > > #include <linux/ftrace.h>
> > > > +#include <linux/sysrq.h>
> > > >
> > > > #include "tree.h"
> > > > #include "rcu.h"
> > > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT; int
> > > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes
> > > > in use. */
> > > > /* panic() on RCU Stall sysctl. */ int sysctl_panic_on_rcu_stall
> > > > __read_mostly;
> > > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool
> > > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > > >
> > > > /*
> > > > * The rcu_scheduler_active variable is initialized to the value
> > > > @@
> > > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void) }
> > > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > > >
> > > > +/* Dump grace-period-request information due to commandeered sysrq.
> > > > +*/ static void sysrq_show_rcu(int key) {
> > > > + show_rcu_gp_kthreads();
> > > > +}
> > > > +
> > > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > > + .handler = sysrq_show_rcu,
> > > > + .help_msg = "show-rcu(y)",
> > > > + .action_msg = "Show RCU tree",
> > > > + .enable_mask = SYSRQ_ENABLE_DUMP, };
> > > > +
> > > > +static int __init rcu_sysrq_init(void) {
> > > > + if (sysrq_rcu)
> > > > + return register_sysrq_key('y', &sysrq_rcudump_op);
> > > > + return 0;
> > > > +}
> > > > +early_initcall(rcu_sysrq_init);
> > > > +
> > > > /*
> > > > * Send along grace-period-related data for rcutorture diagnostics.
> > > > */
> > > >
> > >
> >
>
>