Re: [PATCH v2] sched: Have do_idle() call __schedule() without enabling preemption

From: Paul E. McKenney
Date: Wed Apr 12 2017 - 19:49:11 EST


On Wed, Apr 12, 2017 at 02:27:44PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
>
> I finally got around to creating trampolines for dynamically allocated
> ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
> function hook callbacks, like perf, that allocate the ftrace_ops
> descriptor via kmalloc() and friends, ftrace was not able to optimize
> the functions being traced to use a trampoline because they would also
> need to be allocated dynamically. The problem is that they cannot be
> freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
> was preempted on the trampoline. That was before Paul McKenney
> implemented synchronize_rcu_tasks() that would make sure all tasks
> (except idle) have scheduled out or have entered user space.
>
> While testing this, I triggered this bug:
>
> BUG: unable to handle kernel paging request at ffffffffa0230077
> IP: 0xffffffffa0230077
> PGD 2414067
> PUD 2415063
> PMD c463c067
> PTE 0
>
> Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi
> snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
> snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal kvm_intel
> i915 snd_seq kvm snd_seq_device snd_pcm i2c_algo_bit snd_timer
> drm_kms_helper irqbypass syscopyarea sysfillrect sysimgblt fb_sys_fops
> drm i2c_i801 snd soundcore wmi i2c_core video e1000e ptp pps_core CPU:
> 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc3-test+ #356 Hardware
> name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05
> 05/07/2012 task: ffff8800cebbd0c0 task.stack: ffff8800cebd8000 RIP:
> 0010:0xffffffffa0230077 RSP: 0018:ffff8800cebdfd80 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff8800cebbd0c0 RCX: ffffffff8121391e RDX:
> dffffc0000000000 RSI: ffffffff81ce7c78 RDI: ffff8800cebbf298 RBP:
> ffff8800cebdfe28 R08: 0000000000000003 R09: 0000000000000000 R10:
> 0000000000000000 R11: 0000000000000000 R12: ffff8800cebbd0c0 R13:
> ffffffff828fbe90 R14: ffff8800d392c480 R15: ffffffff826cc380 FS:
> 0000000000000000(0000) GS:ffff8800d3900000(0000)
> knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033 CR2: ffffffffa0230077 CR3: 0000000002413000 CR4:
> 00000000001406e0 Call Trace: ? debug_smp_processor_id+0x17/0x20 ?
> sched_ttwu_pending+0x79/0x190 ? schedule+0x5/0xe0 ?
> trace_hardirqs_on_caller+0x182/0x280 schedule+0x5/0xe0
> schedule_preempt_disabled+0x18/0x30 ? schedule+0x5/0xe0
> ? schedule_preempt_disabled+0x18/0x30
> do_idle+0x172/0x220
>
> What happened was that the idle task was preempted on the trampoline.
> As synchronize_rcu_tasks() ignores the idle thread, there's nothing
> that lets ftrace know that the idle task was preempted on a trampoline.
>
> The idle task shouldn't need to ever enable preemption. The idle task
> is simply a loop that calls schedule or places the cpu into idle mode.
> In fact, having preemption enabled is inefficient, because it can
> happen when idle is just about to call schedule anyway, which would
> cause schedule to be called twice. Once for when the interrupt came in
> and was returning back to normal context, and then again in the normal
> path that the idle loop is running in, which would be pointless, as it
> had already scheduled.
>
> Adding a new function local to kernel/sched/ that allows idle to call
> the scheduler without enabling preemption, fixes the
> synchronize_rcu_tasks) issue, as well as removes the pointless spurious
> scheduled caused by interrupts happening in the brief window where
> preemption is enabled just before it calls schedule.
>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>

I don't understand why schedule_idle() needs a loop given that
schedule_preempt_disabled() doesn't have one, but other than
that, for whatever it is worth, it looks good to me.

Thanx, Paul

> ---
> Changes from v1:
>
> o Added Thomas to Cc as he wrote the original generic idle code
>
> o Added comment to why schedule_idle() exits
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3b31fc0..0788286 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3502,6 +3502,23 @@ asmlinkage __visible void __sched schedule(void)
> }
> EXPORT_SYMBOL(schedule);
>
> +/*
> + * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
> + * state (have scheduled out non-voluntarily) by making sure that all
> + * tasks have either left the run queue or have gone into user space.
> + * As idle tasks do not do either, they must not ever be preempted
> + * (schedule out non-voluntarily).
> + *
> + * schedule_idle() is similar to schedule_preempt_disable() except
> + * that it never enables preemption.
> + */
> +void __sched schedule_idle(void)
> +{
> + do {
> + __schedule(false);
> + } while (need_resched());
> +}
> +
> #ifdef CONFIG_CONTEXT_TRACKING
> asmlinkage __visible void __sched schedule_user(void)
> {
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index ac6d517..229c17e 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -264,7 +264,7 @@ static void do_idle(void)
> smp_mb__after_atomic();
>
> sched_ttwu_pending();
> - schedule_preempt_disabled();
> + schedule_idle();
> }
>
> bool cpu_in_idle(unsigned long pc)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5cbf922..c5ee02b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1465,6 +1465,8 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
> }
> #endif
>
> +extern void schedule_idle(void);
> +
> extern void sysrq_sched_debug_show(void);
> extern void sched_init_granularity(void);
> extern void update_max_interval(void);
>