Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality

From: Chris Metcalf
Date: Mon Mar 07 2016 - 15:51:55 EST


On 03/03/2016 06:46 PM, Andy Lutomirski wrote:
On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf <cmetcalf@xxxxxxxxxxxx> wrote:
On 03/02/2016 07:36 PM, Andy Lutomirski wrote:
On Mar 2, 2016 12:10 PM, "Chris Metcalf" <cmetcalf@xxxxxxxxxx> wrote:
In prepare_exit_to_usermode(), call task_isolation_ready()
when we are checking the thread-info flags, and after we've handled
the other work, call task_isolation_enter() unconditionally.

In syscall_trace_enter_phase1(), we add the necessary support for
strict-mode detection of syscalls.
[...]
@@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct
pt_regs *regs, u32 arch)
*/
if (work & _TIF_NOHZ) {
enter_from_user_mode();
+ if (task_isolation_check_syscall(regs->orig_ax)) {
+ regs->orig_ax = -1;
+ return 0;
+ }
This needs a comment indicating the intended semantics.
And I've still heard no explanation of why this part can't use seccomp.

Here's an excerpt from my earlier reply to you from:

https://lkml.kernel.org/r/55AE9EAC.4010202@xxxxxxxxxx

Admittedly this patch series has been moving very slowly through
review, so it's not surprising we have to revisit some things!

On 07/21/2015 03:34 PM, Chris Metcalf wrote:
On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
If a user wants a syscall to kill them, use
seccomp. The kernel isn't at fault if the user does a syscall when it
didn't want to enter the kernel.

Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT
was to what I wanted here. One concern is that there doesn't seem
to be a way to "escape" from seccomp strict mode, i.e. you can't
call seccomp() again to turn it off - which makes sense for seccomp
since it's a security issue, but not so much sense with cpu_isolated.

So, do you think there's a good role for the seccomp() API to play
in achieving this goal? It's certainly not a question of "the kernel at
fault" but rather "asking the kernel to help catch user mistakes"
(typically third-party libraries in our customers' experience). You
could imagine a SECCOMP_SET_MODE_ISOLATED or something.

Alternatively, we could stick with the API proposed in my patch
series, or something similar, and just try to piggy-back on the seccomp
internals to make it happen. It would require Kconfig to ensure
that SECCOMP was enabled though, which obviously isn't currently
required to do cpu isolation.

On looking at this again just now, one thing that strikes me is that
it may not be necessary to forbid the syscall like seccomp does.
It may be sufficient just to trigger the task isolation strict signal
and then allow the syscall to complete. After all, we don't "fail"
any of the other things that upset strict mode, like page faults; we
let them complete, but add a signal. So for consistency, I think it
may in fact make sense to simply trigger the signal but let the
syscall do its thing. After all, perhaps the signal is handled
and logged and we don't mind having the application continue; the
signal handler can certainly choose to fail hard, or in the usual
case of no signal handler, that kills the task just fine too.
Allowing the syscall to complete is really kind of incidental.
No, don't do that. First, if you have a signal pending, a lot of
syscalls will abort with -EINTR. Second, if you fire a signal on
entry via sigreturn, you're not going to like the results.

OK, you've convinced me to stick with the previous model of just
forbidding the syscall in this case.

Let task isolation users who want to detect when they screw up and do
a syscall do it with seccomp.

Can you give me more details on what you're imagining here? Remember
that a key use case is that these applications can remove the syscall
prohibition voluntarily; it's only there to prevent unintended uses
(by third party libraries or just straight-up programming bugs).
As far as I can tell, seccomp does not allow you to go from "less
permissive" to "more permissive" settings at all, which means that as
it exists, it's not a good solution for this use case.

Or were you thinking about a new seccomp API that allows this?

Or were you thinking that I could just use seccomp internals, i.e.
allow the prctl() to set a special SECCOMP_MODE_TASK_ISOLATION
and handle it appropriately in seccomp_phase1(), maybe? But, not
touch the actual seccomp() API?

I'm happy to spec something out, but I'd definitely benefit from some
sense from you as to what you think is the better approach.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com