Re: Using ftrace/perf as a basis for generic seccomp

From: Eric Paris
Date: Fri Feb 04 2011 - 11:30:28 EST


On Fri, 2011-02-04 at 15:31 +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-03 at 20:50 -0500, Eric Paris wrote:
> > I'm going to try to work on it over
> > the next week or two.
>
> What is your use-case? Going by: http://lwn.net/Articles/332990/ syscall
> based stuff (seccomp) is broken by design.

My personal goal is very different than an LSM. My goal is to reduce
attack surface. I'm not trying to implement an LSM. LSM hooks are
(intentionally) placed in the kernel after object resolution is
complete. In an LSM we don't check 'open' type operation until after
the pathname has been converted to an inode. We don't check some
'sendto' operations until after the data has been placed into an skb and
is about to be queued to a socket. There is a LOT of code between
syscall_entry() and any given LSM hook.

An obvious vulnerability that I'm sure all the people involved here know
would be the original perf syscall bounds checking vulnerability. If
I'm dealing with an application that I know will never use perf I'd like
a way to be able to completely disable the perf syscall and greatly
reduce the kernel attack surface. It would be almost impossible for an
LSM to hook between the syscall_enter() and the location of that
vulnerability in the perf syscall. In my particular case I'm thinking
about qemu, which never needs to call perf. I want a way to disable all
of the code after syscall_enter() for huge swaths of the kernel.

What we have today, called "seccomp", is a one way toggle,
prctl(PR_SET_SECCOMP, 1), which reduces the available syscalls to
read,write,exit, and sigreturn. Any other syscall results in a process
being immediately killed. It's a great idea to reduce the attack
surface of the kernel but it is too inflexible to be useful. I wonder
if anyone is using it.

Qemu on my box in just a couple of seconds of strace was found to use
futex, ioctl, read, rt_sigaction, select, timer_gettime, timer_settime,
and write. I'm sure that other well defined processes have other such
sort lists of required syscalls. I think a more flexible seccomp which
lets one remove syscalls from the allowed set (but never add them back)
can GREATLY reduce the kernel attack surface from malicious processes.

This is not a sandbox. This is not an LSM replacement. This is a per
syscall cutoff. It can be used to help build a stronger sandbox. I'll
likely see if this can't be used by the SELinux sandbox which already
uses the LSM hooks to control information flow and mediate access. But
SELinux does not control the sheer amount of the kernel code that can be
executed. I believe we can build a stronger sandbox using a flexible
seccomp as one of the tools. All we have to do is find one
vulnerability in the code between the syscall entry and a LSM hook which
would deny to operation to see the value in a per syscall control
mechanism.

As to doing it in seccomp code where it's all of a syscall or none vs
making use of the filter infrastructure to allow even more fine grained
control over the syscall is a question. I'm leaning more towards just
doing it in seccomp. We can't ever build a full and complete strong
sandbox using the filter code. James' assertions about copy_from_user()
are obviously correct. A chat with PeterZ privately on IRC indicated
that he also was not interested in seeing this creep into the tracing
code. Do we have a user that can articulate a need for greater
flexibility in their use of such a hardening tool?

I think given all these things I'm going to go back to looking at the
flexible seccomp for now. And maybe we should work towards using the
tracing filter code in the future if someone can articulate a real use
case.....

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/