Re: [tip: core/rcu] rcu: Enable tick for nohz_full CPUs slow to provide expedited QS

From: Borislav Petkov
Date: Sat Jan 25 2020 - 08:15:12 EST


Hi Paul,

On Sat, Jan 25, 2020 at 10:42:56AM -0000, tip-bot2 for Paul E. McKenney wrote:
> The following commit has been merged into the core/rcu branch of tip:
>
> Commit-ID: df1e849ae4559544ff00ff5052eefe2479750539
> Gitweb: https://git.kernel.org/tip/df1e849ae4559544ff00ff5052eefe2479750539
> Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> AuthorDate: Wed, 27 Nov 2019 16:36:45 -08:00
> Committer: Paul E. McKenney <paulmck@xxxxxxxxxx>
> CommitterDate: Mon, 09 Dec 2019 12:32:59 -08:00
>
> rcu: Enable tick for nohz_full CPUs slow to provide expedited QS
>
> An expedited grace period can be stalled by a nohz_full CPU looping
> in kernel context. This possibility is currently handled by some
> carefully crafted checks in rcu_read_unlock_special() that enlist help
> from ksoftirqd when permitted by the scheduler. However, it is exactly
> these checks that require the scheduler avoid holding any of its rq or
> pi locks across rcu_read_unlock() without also having held them across
> the entire RCU read-side critical section.
>
> It would therefore be very nice if expedited grace periods could
> handle nohz_full CPUs looping in kernel context without such checks.
> This commit therefore adds code to the expedited grace period's wait
> and cleanup code that forces the scheduler-clock interrupt on for CPUs
> that fail to quickly supply a quiescent state. "Quickly" is currently
> a hard-coded single-jiffy delay.
>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> ---
> include/linux/tick.h | 5 +++-
> kernel/rcu/tree.h | 1 +-
> kernel/rcu/tree_exp.h | 52 ++++++++++++++++++++++++++++++++++++------
> 3 files changed, 50 insertions(+), 8 deletions(-)

...

> @@ -460,22 +489,31 @@ static void synchronize_rcu_expedited_wait(void)
> unsigned long jiffies_start;
> unsigned long mask;
> int ndetected;
> + struct rcu_data *rdp;
> struct rcu_node *rnp;
> struct rcu_node *rnp_root = rcu_get_root();
> - int ret;
>
> trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait"));
> jiffies_stall = rcu_jiffies_till_stall_check();
> jiffies_start = jiffies;
> + if (IS_ENABLED(CONFIG_NO_HZ_FULL)) {
> + if (synchronize_rcu_expedited_wait_once(1))
> + return;
> + rcu_for_each_leaf_node(rnp) {
> + for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
> + rdp = per_cpu_ptr(&rcu_data, cpu);
> + if (rdp->rcu_forced_tick_exp)
> + continue;
> + rdp->rcu_forced_tick_exp = true;
> + tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP);
> + }
> + }
> + WARN_ON_ONCE(1);

I'm hitting this on a big AMD box, CONFIG_NO_HZ_FULL=y.

Kernel is Linus master + tip/master from today.

...
[ 23.094781] BTRFS info (device sdb2): disk space caching is enabled
[ 23.094784] BTRFS info (device sdb2): has skinny extents
[ 23.139134] BTRFS info (device sdb2): enabling ssd optimizations
[ 23.395434] ------------[ cut here ]------------
[ 23.402616] WARNING: CPU: 6 PID: 1921 at kernel/rcu/tree_exp.h:511 rcu_exp_wait_wake+0x11f/0x740
[ 23.402616] Modules linked in: btrfs(E) libcrc32c(E) xor(E) hid_generic(E) usbhid(E) raid6_pq(E) sd_mod(E) crc32c_intel(E) ast(E) i2c_algo_bit(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) ahci(E) drm_kms_helper(E) syscopyarea(E) libahci(E) sysfillrect(E) xhci_pci(E) sysimgblt(E) fb_sys_fops(E) xhci_hcd(E) libata(E) drm(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E)
[ 23.402630] CPU: 6 PID: 1921 Comm: kworker/6:2 Tainted: G E 5.5.0-rc7+ #1
[ 23.402632] Workqueue: rcu_gp wait_rcu_exp_gp
[ 23.402634] RIP: 0010:rcu_exp_wait_wake+0x11f/0x740
[ 23.402636] Code: 3b 45 6c 7e be 48 63 05 8b 30 50 01 49 81 c5 40 02 00 00 48 8d 04 c0 48 c1 e0 06 48 05 80 b2 67 b5 49 39 c5 0f 82 72 ff ff ff <0f> 0b 49 c7 c7 00 fd 02 00 bb 01 00 00 00 4c 89 e7 e8 ab e7 ff ff
[ 23.402637] RSP: 0018:ffffbde38a8f3e50 EFLAGS: 00010246
[ 23.402637] RAX: ffffffffb567d8c0 RBX: ffffffffb567d6c0 RCX: 0000000000000040
[ 23.402638] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffb567d6c0
[ 23.402638] RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000001
[ 23.402639] R10: ffffbde3802cfdc0 R11: 0000000000000271 R12: 0000000000003a98
[ 23.402640] R13: ffffffffb567d8c0 R14: 00000000000000f0 R15: 000000000002fd00
[ 23.402640] FS: 0000000000000000(0000) GS:ffffa016cd380000(0000) knlGS:0000000000000000
[ 23.402641] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 23.402641] CR2: 00007fc9957c5180 CR3: 0000800bb3e0a000 CR4: 0000000000340ee0
[ 23.402642] Call Trace:
[ 23.402647] ? sync_rcu_exp_select_cpus+0x219/0x3e0
[ 23.402650] process_one_work+0x20b/0x400
[ 23.402652] worker_thread+0x2d/0x3f0
[ 23.402653] ? process_one_work+0x400/0x400
[ 23.402655] kthread+0x10d/0x130
[ 23.402656] ? kthread_park+0x90/0x90
[ 23.402660] ret_from_fork+0x27/0x50
[ 23.402663] ---[ end trace 85fb288edc35e984 ]---
[ 24.322452] BTRFS info (device sdb2): disk space caching is enabled

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette