[RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

From: Vineeth Remanan Pillai
Date: Tue Jun 30 2020 - 17:34:02 EST


From: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>

With current core scheduling patchset, non-threaded IRQ and softirq
victims can leak data from its hyperthread to a sibling hyperthread
running an attacker.

For MDS, it is possible for the IRQ and softirq handlers to leak data to
either host or guest attackers. For L1TF, it is possible to leak to
guest attackers. There is no possible mitigation involving flushing of
buffers to avoid this since the execution of attacker and victims happen
concurrently on 2 or more HTs.

The solution in this patch is to monitor the outer-most core-wide
irq_enter() and irq_exit() executed by any sibling. In between these
two, we mark the core to be in a special core-wide IRQ state.

In the IRQ entry, if we detect that the sibling is running untrusted
code, we send a reschedule IPI so that the sibling transitions through
the sibling's irq_exit() to do any waiting there, till the IRQ being
protected finishes.

We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
outer-most irq_exit(), the core is still in the special core-wide IRQ
state, we perform a busy-wait till the core exits this state. This
combination of per-cpu and core-wide IRQ states helps to handle any
combination of irq_entry()s and irq_exit()s happening on all of the
siblings of the core in any order.

Lastly, we also check in the schedule loop if we are about to schedule
an untrusted process while the core is in such a state. This is possible
if a trusted thread enters the scheduler by way of yielding CPU. This
would involve no transitions through the irq_exit() point to do any
waiting, so we have to explicitly do the waiting there.

Every attempt is made to prevent a busy-wait unnecessarily, and in
testing on real-world ChromeOS usecases, it has not shown a performance
drop. In ChromeOS, with this and the rest of the core scheduling
patchset, we see around a 300% improvement in key press latencies into
Google docs when Camera streaming is running simulatenously (90th
percentile latency of ~150ms drops to ~50ms).

This fetaure is controlled by the build time config option
CONFIG_SCHED_CORE_IRQ_PAUSE and is enabled by default. There is also a
kernel boot parameter 'sched_core_irq_pause' to enable/disable the
feature at boot time. Default is enabled at boot time.

Cc: Julien Desfossez <jdesfossez@xxxxxxxxxxxxxxxx>
Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Cc: Aaron Lu <aaron.lwe@xxxxxxxxx>
Cc: Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx>
Cc: Tim Chen <tim.c.chen@xxxxxxxxx>
Cc: Paul E. McKenney <paulmck@xxxxxxxxxx>
Co-developed-by: Vineeth Pillai <vpillai@xxxxxxxxxxxxxxxx>
Signed-off-by: Vineeth Pillai <vpillai@xxxxxxxxxxxxxxxx>
Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
---
.../admin-guide/kernel-parameters.txt | 9 +
include/linux/sched.h | 5 +
kernel/Kconfig.preempt | 13 ++
kernel/sched/core.c | 161 ++++++++++++++++++
kernel/sched/sched.h | 7 +
kernel/softirq.c | 46 +++++
6 files changed, 241 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5e2ce88d6eda..d44d7a997610 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4445,6 +4445,15 @@

sbni= [NET] Granch SBNI12 leased line adapter

+ sched_core_irq_pause=
+ [SCHED_CORE, SCHED_CORE_IRQ_PAUSE] Pause SMT siblings
+ of a core if atleast one of the siblings of the core
+ is running nmi/irq/softirq. This is to guarentee that
+ kernel data is not leaked to tasks which are not trusted
+ by the kernel.
+ This feature is valid only when Core scheduling is
+ enabled(SCHED_CORE).
+
sched_debug [KNL] Enables verbose scheduler debug messages.

schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f9edf013df3..097746a9f260 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2025,4 +2025,9 @@ int sched_trace_rq_cpu(struct rq *rq);

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+void sched_core_irq_enter(void);
+void sched_core_irq_exit(void);
+#endif
+
#endif
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 4488fbf4d3a8..59094a66a987 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,3 +86,16 @@ config SCHED_CORE
default y
depends on SCHED_SMT

+config SCHED_CORE_IRQ_PAUSE
+ bool "Pause siblings on entering irq/softirq during core-scheduling"
+ default y
+ depends on SCHED_CORE
+ help
+ This option enables pausing all SMT siblings of a core when atleast
+ one of the siblings in the core is in nmi/irq/softirq. This is to
+ enforce security such that information from kernel is not leaked to
+ non-trusted tasks running on siblings. This option is valid only if
+ Core Scheduling(CONFIG_SCHED_CORE) is enabled.
+
+ If in doubt, select 'Y' when CONFIG_SCHED_CORE=y
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ede86fb37b4e..2ec56970d6bb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4252,6 +4252,155 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+/*
+ * Helper function to pause the caller's hyperthread until the core exits the
+ * core-wide IRQ state. Obviously the CPU calling this function should not be
+ * responsible for the core being in the core-wide IRQ state otherwise it will
+ * deadlock. This function should be called from irq_exit() and from schedule().
+ * It is upto the callers to decide if calling here is necessary.
+ */
+static inline void sched_core_sibling_irq_pause(struct rq *rq)
+{
+ /*
+ * Wait till the core of this HT is not in a core-wide IRQ state.
+ *
+ * Pair with smp_store_release() in sched_core_irq_exit().
+ */
+ while (smp_load_acquire(&rq->core->core_irq_nest) > 0)
+ cpu_relax();
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_irq_enter(void)
+{
+ int i, cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ const struct cpumask *smt_mask;
+
+ if (!sched_core_enabled(rq))
+ return;
+
+ /* Count irq_enter() calls received without irq_exit() on this CPU. */
+ rq->core_this_irq_nest++;
+
+ /* If not outermost irq_enter(), do nothing. */
+ if (WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX) ||
+ rq->core_this_irq_nest != 1)
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ smt_mask = cpu_smt_mask(cpu);
+
+ /* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */
+ WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1);
+ if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX))
+ goto unlock;
+
+ if (rq->core_pause_pending) {
+ /*
+ * Do nothing more since we are in a 'reschedule IPI' sent from
+ * another sibling. That sibling would have sent IPIs to all of
+ * the HTs.
+ */
+ goto unlock;
+ }
+
+ /*
+ * If we are not the first ones on the core to enter core-wide IRQ
+ * state, do nothing.
+ */
+ if (rq->core->core_irq_nest > 1)
+ goto unlock;
+
+ /* Do nothing more if the core is not tagged. */
+ if (!rq->core->core_cookie)
+ goto unlock;
+
+ for_each_cpu(i, smt_mask) {
+ struct rq *srq = cpu_rq(i);
+
+ if (i == cpu || cpu_is_offline(i))
+ continue;
+
+ if (!srq->curr->mm || is_idle_task(srq->curr))
+ continue;
+
+ /* Skip if HT is not running a tagged task. */
+ if (!srq->curr->core_cookie && !srq->core_pick)
+ continue;
+
+ /* IPI only if previous IPI was not pending. */
+ if (!srq->core_pause_pending) {
+ srq->core_pause_pending = 1;
+ smp_send_reschedule(i);
+ }
+ }
+unlock:
+ raw_spin_unlock(rq_lockp(rq));
+}
+
+/*
+ * Process any work need for either exiting the core-wide IRQ state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ */
+void sched_core_irq_exit(void)
+{
+ int cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ bool wait_here = false;
+ unsigned int nest;
+
+ /* Do nothing if core-sched disabled. */
+ if (!sched_core_enabled(rq))
+ return;
+
+ rq->core_this_irq_nest--;
+
+ /* If not outermost on this CPU, do nothing. */
+ if (WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX) ||
+ rq->core_this_irq_nest > 0)
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ /*
+ * Core-wide nesting counter can never be 0 because we are
+ * still in it on this CPU.
+ */
+ nest = rq->core->core_irq_nest;
+ WARN_ON_ONCE(!nest);
+
+ /*
+ * If we still have other CPUs in IRQs, we have to wait for them.
+ * Either here, or in the scheduler.
+ */
+ if (rq->core->core_cookie && nest > 1) {
+ /*
+ * If we are entering the scheduler anyway, we can just wait
+ * there for ->core_irq_nest to reach 0. If not, just wait here.
+ */
+ if (!tif_need_resched()) {
+ wait_here = true;
+ }
+ }
+
+ if (rq->core_pause_pending)
+ rq->core_pause_pending = 0;
+
+ /* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */
+ smp_store_release(&rq->core->core_irq_nest, nest - 1);
+ raw_spin_unlock(rq_lockp(rq));
+
+ if (wait_here)
+ sched_core_sibling_irq_pause(rq);
+}
+#endif /* CONFIG_SCHED_CORE_IRQ_PAUSE */
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -4732,6 +4881,18 @@ static void __sched notrace __schedule(bool preempt)
rq_unlock_irq(rq, &rf);
}

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ /*
+ * If a CPU that was running a trusted task entered the scheduler, and
+ * the next task is untrusted, then check if waiting for core-wide IRQ
+ * state to cease is needed since we would not have been able to get
+ * the services of irq_exit() to do that waiting.
+ */
+ if (sched_core_enabled(rq) &&
+ !is_idle_task(next) && next->mm && next->core_cookie)
+ sched_core_sibling_irq_pause(rq);
+#endif
+
balance_callback(rq);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a5450886c4e4..6445943d3215 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1041,11 +1041,18 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
unsigned char core_forceidle;
+ unsigned char core_pause_pending;
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ unsigned int core_this_irq_nest;
+#endif

/* shared state */
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ unsigned int core_irq_nest;
+#endif
#endif
};

diff --git a/kernel/softirq.c b/kernel/softirq.c
index a47c6dd57452..0745f1c6b352 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -246,6 +246,24 @@ static inline bool lockdep_softirq_start(void) { return false; }
static inline void lockdep_softirq_end(bool in_hardirq) { }
#endif

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+DEFINE_STATIC_KEY_TRUE(sched_core_irq_pause);
+static int __init set_sched_core_irq_pause(char *str)
+{
+ unsigned long val = 0;
+ if (!str)
+ return 0;
+
+ val = simple_strtoul(str, &str, 0);
+
+ if (val == 0)
+ static_branch_disable(&sched_core_irq_pause);
+
+ return 1;
+}
+__setup("sched_core_irq_pause=", set_sched_core_irq_pause);
+#endif
+
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
@@ -273,6 +291,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ /*
+ * Core scheduling mitigations require entry into softirq to send stall
+ * IPIs to sibling hyperthreads if needed (ex, sibling is running
+ * untrusted task). If we are here from irq_exit(), no IPIs are sent.
+ */
+ if (static_branch_likely(&sched_core_irq_pause))
+ sched_core_irq_enter();
+#endif
+
local_irq_enable();

h = softirq_vec;
@@ -305,6 +333,12 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
rcu_softirq_qs();
local_irq_disable();

+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ /* Inform the scheduler about exit from softirq. */
+ if (static_branch_likely(&sched_core_irq_pause))
+ sched_core_irq_exit();
+#endif
+
pending = local_softirq_pending();
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
@@ -345,6 +379,12 @@ asmlinkage __visible void do_softirq(void)
void irq_enter(void)
{
rcu_irq_enter();
+
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ if (static_branch_likely(&sched_core_irq_pause))
+ sched_core_irq_enter();
+#endif
+
if (is_idle_task(current) && !in_interrupt()) {
/*
* Prevent raise_softirq from needlessly waking up ksoftirqd
@@ -413,6 +453,12 @@ void irq_exit(void)
invoke_softirq();

tick_irq_exit();
+
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+ if (static_branch_likely(&sched_core_irq_pause))
+ sched_core_irq_exit();
+#endif
+
rcu_irq_exit();
/* must be last! */
lockdep_hardirq_exit();
--
2.17.1