[RFC][PATCH] sched: Have do_idle() call __schedule() without enabling preemption

From: Steven Rostedt
Date: Tue Apr 11 2017 - 22:44:46 EST


Peter,

I hope Thomas didn't lend you his frozen shark thrower. ;-)

Let me explain the issue we have. A while back, I asked Paul if he
could add a special RCU case, where the quiescent states are if a task
goes into user land (it's fine if it's there), is sleeping (off the run
queue) or voluntarily schedules. He implemented this for me where after
calling synchronize_rcu_tasks() all tasks is in the quiescent state
(user space or not on a run queue) or has done a voluntary schedule
(calling __schedule(false)).

The one issue is that this ignores the idle task, as the idle task is
never in user space and is always on the run queue. The only thing we
could hope for is if it calls schedule(false). Thus
synchronize_rcu_tasks() ignores it.

The reason I need this is to be able to have ftrace hooks that are
created dynamically (like what perf does when it uses function
tracing), to be able to use the optimize dynamic trampoline. What that
means is, if there's a single callback for a function hook, a
trampoline is created dynamically and that hook (fentry) calls the
dynamic trampoline directly. That dynamic trampoline calls the function
callback directly without caring about any other function hook that may
be registered to other functions.

If perf does function tracing and there's no other code doing any
function tracing, then all the functions it traces will call this
dynamically allocated trampoline which will call the perf function
tracer directly. Otherwise it will call the default trampoline that
calls a loop function to iterate over any registered function callbacks
with ftrace.

The issue occurs when perf is done tracing. Now we need to free this
dynamically created trampoline. But in order to do that, we need to
make sure all tasks are not executing on it. One thing ftrace does
before freeing the trampoline after detaching the hook from the
functions, is to schedule on every CPU, which will make sure everything
is out of a preempt section. This is more severe than
synchronize_sched() because ftrace is recorded in locations that RCU is
not active (like going to idle), and synchronize_sched() is not good
enough. But even scheduling on all CPUs is not good enough when the
kernel itself is preemptive. In that case we use the
synchronize_rcu_tasks() that covers those cases where a task has been
preempted. Except for one!

This brings us back to synchronize_rcu_tasks() ignoring the idle case.
Where I trigger this:

BUG: unable to handle kernel paging request at ffffffffa0230077
IP: 0xffffffffa0230077
PGD 2414067
PUD 2415063
PMD c463c067
PTE 0

Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal kvm_intel i915 snd_seq kvm snd_seq_device snd_pcm i2c_algo_bit snd_timer drm_kms_helper irqbypass syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_i801 snd soundcore wmi i2c_core video e1000e ptp pps_core
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc3-test+ #356
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
task: ffff8800cebbd0c0 task.stack: ffff8800cebd8000
RIP: 0010:0xffffffffa0230077
RSP: 0018:ffff8800cebdfd80 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8800cebbd0c0 RCX: ffffffff8121391e
RDX: dffffc0000000000 RSI: ffffffff81ce7c78 RDI: ffff8800cebbf298
RBP: ffff8800cebdfe28 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800cebbd0c0
R13: ffffffff828fbe90 R14: ffff8800d392c480 R15: ffffffff826cc380
FS: 0000000000000000(0000) GS:ffff8800d3900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0230077 CR3: 0000000002413000 CR4: 00000000001406e0
Call Trace:
? debug_smp_processor_id+0x17/0x20
? sched_ttwu_pending+0x79/0x190
? schedule+0x5/0xe0
? trace_hardirqs_on_caller+0x182/0x280
schedule+0x5/0xe0
schedule_preempt_disabled+0x18/0x30
? schedule+0x5/0xe0
? schedule_preempt_disabled+0x18/0x30
do_idle+0x172/0x220

That RIP of ffffffffa0230077 is on the trampoline. The return address
is schedule+0x5 which happens to be from a call to fentry. Notice the
Comm: swapper/2.

I can trigger this with a few runs of my test case, pretty
consistently. When I looked at the code, I don't see any reason for
idle to ever enable preemption. It currently calls
schedule_preempt_disabled(), which enables preemption, calls schedule()
and then disables preemption.

The schedule() calls sched_submit_work() which immediately returns
because of the check if the task is running, and idle is always
running. Then it does a loop of disabling preemption calling
__schedule() then enabling preemption again, and checking
need_resched().

The above bug happens in schedule_preempt_disabled, where preemption is
enabled, then schedule() is called, which is traced by ftrace. But then
an interrupt came in while the idle task was on the ftrace trampoline,
and when the interrupt returned, it scheduled via the interrupt preempt
scheduling and not the call to schedule itself, which I feel is
inefficient anyway, as idle is going to call schedule again anyway.

The trampoline is freed (because none of the synchronizations were able
to detect idle was preempted on a trampoline), and when idle gets to
run again, it crashes (it's executing code that no longer exists).

The solution I'm proposing here is to create a schedule_idle() call,
that is local to kernel/sched/ which do_idle() calls instead and this
basically does exactly the same thing that schedule() does, except that
it doesn't enable preemption. It calls __schedule() in a loop that
checks for need_resched(). Not only does this solve the bug I'm
triggering with trying to free dynamically created ftrace trampolines,
it also makes the idle code a bit more efficient without having these
spurious schedule calls when interrupts occur in this small window when
preemption is enabled.

Signed-off-by: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b31fc0..cb9ceda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3502,6 +3502,13 @@ asmlinkage __visible void __sched schedule(void)
}
EXPORT_SYMBOL(schedule);

+void __sched schedule_idle(void)
+{
+ do {
+ __schedule(false);
+ } while (need_resched());
+}
+
#ifdef CONFIG_CONTEXT_TRACKING
asmlinkage __visible void __sched schedule_user(void)
{
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ac6d517..229c17e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -264,7 +264,7 @@ static void do_idle(void)
smp_mb__after_atomic();

sched_ttwu_pending();
- schedule_preempt_disabled();
+ schedule_idle();
}

bool cpu_in_idle(unsigned long pc)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5cbf922..c5ee02b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1465,6 +1465,8 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
}
#endif

+extern void schedule_idle(void);
+
extern void sysrq_sched_debug_show(void);
extern void sched_init_granularity(void);
extern void update_max_interval(void);