Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality

From: Chris Metcalf
Date: Mon Sep 28 2015 - 17:57:48 EST


On 09/28/2015 04:59 PM, Andy Lutomirski wrote:
On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@xxxxxxxxxx> wrote:
In prepare_exit_to_usermode(), we would like to call
task_isolation_enter() on every return to userspace, and like
other work items, we would like to recheck for more work after
calling it, since it will enable interrupts internally.

However, if task_isolation_enter() is the only work item,
and it has already been called once, we don't want to continue
calling it in a loop. We don't have a dedicated TIF flag for
task isolation, and it wouldn't make sense to have one, since
we'd want to set it before starting exit every time, and then
clear it the first time around the loop.

Instead, we change the loop structure somewhat, so that we
have a more inclusive set of flags that are tested for on the
first entry to the function (including TIF_NOHZ), and if any
of those flags are set, we enter the loop. And, we do the
task_isolation() test unconditionally at the bottom of the loop,
but then when making the decision to loop back, we just use the
set of flags that doesn't include TIF_NOHZ. That way we only
loop if there is other work to do, but then if that work
is done, we again unconditionally call task_isolation_enter().

In syscall_trace_enter_phase1(), we try to add the necessary
support for strict-mode detection of syscalls in an optimized
way, by letting the code remain unchanged if we are not using
TASK_ISOLATION, but otherwise calling enter_from_user_mode()
under the first time we see _TIF_NOHZ, and then waiting until
after we do the secure computing work to actually clear the bit
from the "work" variable and call task_isolation_syscall().

Signed-off-by: Chris Metcalf <cmetcalf@xxxxxxxxxx>
---
arch/x86/entry/common.c | 47 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 80dcc9261ca3..0f74389c6f3b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -21,6 +21,7 @@
#include <linux/context_tracking.h>
#include <linux/user-return-notifier.h>
#include <linux/uprobes.h>
+#include <linux/isolation.h>

#include <asm/desc.h>
#include <asm/traps.h>
@@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
enter_from_user_mode();
- work &= ~_TIF_NOHZ;
+ if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
+ work &= ~_TIF_NOHZ;
}
#endif

@@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
}
#endif

+ /* Now check task isolation, if needed. */
+ if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
+ work &= ~_TIF_NOHZ;
+ if (task_isolation_strict())
+ task_isolation_syscall(regs->orig_ax);
+ }
+
This is IMO rather nasty. Can you try to find a way to do this
without making the control flow depend on config options?

Well, I suppose this is the best argument for testing for task
isolation before seccomp :-)

Honestly, if not, it's tricky to see how to do better; I did spend
some time looking at it. One possibility is to just unconditionally
clear _TIF_NOHZ before testing "work == 0", so that we can
test (work & TIF_NOHZ) once early and once after seccomp.
This presumably costs a cycle in the no-nohz-full case.

So maybe just do it before seccomp...

What guarantees that TIF_NOHZ is an acceptable thing to check?

Well, TIF_NOHZ is set on all tasks whenever we are running with
nohz_full enabled anywhere, so testing it lets us do stuff on
the fastpath without slowing down the fastpath much.
See context_tracking_cpu_set().

/* Do our best to finish without phase 2. */
if (work == 0)
return ret; /* seccomp and/or nohz only (ret == 0 here) */
@@ -217,10 +226,26 @@ static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
/* Called with IRQs disabled. */
__visible void prepare_exit_to_usermode(struct pt_regs *regs)
{
+ u32 cached_flags;
+
if (WARN_ON(!irqs_disabled()))
local_irq_disable();

/*
+ * We may want to enter the loop here unconditionally to make
+ * sure to do some work at least once. Test here for all
+ * possible conditions that might make us enter the loop,
+ * and return immediately if none of them are set.
+ */
+ cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+ if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
+ _TIF_UPROBE | _TIF_NEED_RESCHED |
+ _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) {
+ user_enter();
+ return;
+ }
+
Too complicated and too error prone.

In any event, I don't think that the property you actually want is for
the loop to be entered once. I think the property you want is that
we're isolated by the time we're finished. Why not just check that
directly in the loop condition?

So something like this (roughly):

if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
_TIF_UPROBE | _TIF_NEED_RESCHED |
_TIF_USER_RETURN_NOTIFY)) &&
+ task_isolation_done())
break;

i.e. just add the one extra call? That could work, I suppose.
In the body we would then keep the proposed logic that unconditionally
calls task_isolation_enter().

+ /*
* In order to return to user mode, we need to have IRQs off with
* none of _TIF_SIGPENDING, _TIF_NOTIFY_RESUME, _TIF_USER_RETURN_NOTIFY,
* _TIF_UPROBE, or _TIF_NEED_RESCHED set. Several of these flags
@@ -228,15 +253,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
* so we need to loop. Disabling preemption wouldn't help: doing the
* work to clear some of the flags can sleep.
*/
- while (true) {
- u32 cached_flags =
- READ_ONCE(pt_regs_to_thread_info(regs)->flags);
-
- if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
- _TIF_UPROBE | _TIF_NEED_RESCHED |
- _TIF_USER_RETURN_NOTIFY)))
- break;
-
+ do {
/* We have work to do. */
local_irq_enable();

@@ -258,9 +275,17 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
fire_user_return_notifiers();

+ if (task_isolation_enabled())
+ task_isolation_enter();
+
Does anything here guarantee forward progress or at least give
reasonable confidence that we'll make forward progress?

A given task can get stuck in the kernel if it has a lengthy far-future
alarm() type situation, or if there are multiple task-isolated tasks
scheduled onto the same core, but that only affects those tasks;
other tasks on the same core, and the system as a whole, are OK.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/