Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering

From: Ingo Molnar
Date: Thu May 26 2011 - 04:25:45 EST



* James Morris <jmorris@xxxxxxxxx> wrote:

> On Wed, 25 May 2011, Linus Torvalds wrote:
>
> > And per-system-call permissions are very dubious. What system
> > calls don't you want to succeed? That ioctl? You just made it
> > impossible to do a modern graphical application. Yet the kind of
> > thing where we would _want_ to help users is in making it easier
> > to sandbox something like the adobe flash player. But without
> > accelerated direct rendering, that's not going to fly, is it?
>
> Going back to the initial idea proposed by Will, where seccomp is
> simply extended to filter all syscalls, there is potential benefit
> in being able to limit the attack surface of the syscall API.

If controlling the system call boundary is found to be useful then
the logical next logical step is to realize that limiting it to
*only* the syscall boundary is shortsighted.

Also, here's a short reminder of the complexity evolution of this
patch-set, which i've followed since it's been first posted in 2009:

bitmask (2009): 6 files changed, 194 insertions(+), 22 deletions(-)
filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
event filters (2011): 5 files changed, 82 insertions(+), 16 deletions(-)

Interestingly, the events version is *by far* the most flexible one
in both the short and the long run, and it is also the smallest patch
...

It's a perfect fit and that's not really surprising: system call
boundary hardening is about filtering various key parameters - while
event tracing is about filtering various key parameters as well.

But it goes further than that: SELinux security policies are in
essence primitive event filters as well, on an abstract level - see
below for more details.

And yes, the primitive, coarse, per syscall allow/disallow bitmask v1
version would not be too painful to the core kernel in terms of code
impact and interaction with other code (it does not interact at all)
- but it would still be sadly shortsighted to not explore the event
filters angle, now that we have actual working code.

It would not improve the LSM situation one tiny bit either - the
bitmask design would guarantee that the seccomp approach can never
seriously replace the sucky LSM concepts we have in the kernel today.

> This is not security mediation in terms of interaction between
> things (e.g. "allow A to read B"). It's a _hardening_ feature
> which prevents a process from being able to invoke potentially
> hundreds of syscalls is has no need for. It would allow us to
> usefully restrict some well-established attack modes, e.g.
> triggering bugs in kernel code via unneeded syscalls.

If you think about it then you'll see that this artificial
distinction between 'mediation' and 'hardening' is nonsense!

If we add the appropriate file label field to VFS tracing events
(which would be useful for many instrumentation reasons as well) then
the event filtering variant of Will's patch:

_will be able to do object level security mediation too_

What is at the core of every access control concept, be that DAC,
MAC, RBAC or ACL? Flexible task specific set of access vectors to
file and other labeled objects, which cannot be circumvented by that
task.

How can we implement a user-space file object manager via Will's
event filters approach? It's actually pretty easy:

- a simple object manager wants to know 'who' does something, 'what'
it is trying to access, and then wants to generate an allow/deny
action as a function of the (who,what) parameters:

- The 'who' is a given as the event filters are per
task, so different tasks with different roles can have
different event filters. This is the equivalent of the current
tasks's security context. [ Event filters installed by the
parent cannot be removed by child tasks (they cannot even read
them - it's transparent). ]

- The most finegrained representation of 'what' are inode
numbers. Because we do not want to generate rules for every
single object we want to group objects and want to define
access rules on different groups. This can be done by defining
an event that emits file labels.

So a simple object manager would simply use file label event
attributes and would define simple rules like:

"(label & tmp_t) || (label & user_home_t)"

to allow access to /tmp and /home files. Filters allow us to define
arbitrary access vectors to objects in essence. The above filters get
passed to the kernel as an ASCII string by the object manager task,
where the filter engine parses it safely and creates atomic
predicates out of it, which can then be executed at the source of any
event.

[ We could even implement a transparent AVC-cache equivalent for
filters, should the complexity and popularity of them increase:
ASCII filters lend themselves very well to hash based caches. ]

Similarly, support for other types of object tagging, network labels,
etc. can be added as well with little pain: they can be added without
any change to the basic ABI! Using events filters here makes it a
very extensible security concept.

It is capable to implement the functional equivalent of most MAC,
DAC, RBAC and other access control concepts, purely in user-space -
in addition to 'hardening' (which btw. is really access control too,
in disguise).

Obviously it is all layered: it is only allowed to control access to
objects all the other security concepts allow for it to access - i.e.
this is an unprivileged LSM, a per application security layer if you
will, that can further refine security policies.

In terms of security models this event filters approach is
unconditional goodness in my opinion.

> This is orthogonal to access control schemes (such as SELinux),
> which are about mediating security-relevant interactions between
> objects.

It's only 'orthogonal' because IMO you make two fundamental mistakes:

1) You arbitrarily limit SELinux to object based security measures
alone.

Which is not even true btw: SELinux certainly has some hooks it
uses for pragmatic non-object hardening: for example all the
places where we fall back to capabilities are places where
there's a method based restriction not object based restriction.

The KDSKBENT ioctl check for example in
security/selinux/hooks.c::selinux_file_ioctl(), or
selinux_vm_enough_memory(), or the CAP_MAC_ADMIN exception in
selinux_inode_getsecurity() all violate 'pure' DAC concepts but
obviously for pragmatic reasons SELinux is doing these ...

mmap_min_addr is a borderline method restriction feature as well:
it does not really control access to the underlying object (RAM),
but controls one (of many) access methods to it by controlling
virtual memory ...

So SELinux, in a rather hypocritical fashion is already involved
in hardening and in filtering, because obviously any practical
and pragmatic security system *has to*.

2) You arbitrarily limit Will's patch to *not* be able to
implement object based security mechanisms. Why?

Syscall hardening and object based access rules are *deeply*
connected, conceptually they are subsets of one and the same thing: a
good, organic security model controlling different hierarchies of
physical and derived (virtual) resources, which allows flexible
control of both objects *and* methods.

The 'methods' (the syscalls and other functionality) are *also* a
derived resource so it's entirely legitimate to control access to
them. Yes, because they are methods you can also try to use them to
restrict access to underlying objects - this is what AppArmour is
about mostly, and yes i agree that in the general case it's not a
particularly robust method.

And yes, i fully submit that object access control has theoretical
advantages and it should often be the primary measure that gives a
robust, often provable backbone to a secure system.

But you'd be out of your mind to not recognize:

- The utility of controlling access methods (as resources) as well,
both to reduce the attack surface in the implementation of those
methods, and to allow the easy summary control of objects where
there's only a low number of (and often only a single!) access
method.

- The utility of unprivileged security frameworks.

- The utility of stackable security fetures. (defense in depth,
anyone?)

Will's astonishingly small patch:

event filters (2011): 5 files changed, 82 insertions(+), 16 deletions(-)

Gives us *all three* of those, while also allowing user-space
implemented MAC, DAC, RBAC as well.

> One area of possible use is KVM/Qemu, where processes now contain
> entire operating systems, and the attack surface between them is
> now much broader e.g. a local unprivileged vulnerability is now
> effectively a 'remote' full system compromise.

Note that the main reason why Qemu needs access method hardening is
because it has a dominantly state machine based design which does not
lend itself very well to an object manager security design.

Note that tools/kvm/ would probably like to implement its own object
manager model as well in addition to access method restrictions: by
being virtual hardware it deals with many resources and object
hierarchies that are simply not known to the host OS's LSM.

Unlike Qemu tools/kvm/ has a design that is very fit for MAC
concepts: it uses separate helper threads for separate resources
(this could in many cases even be changed to be separate processes
which only share access to the guest RAM image) - while Qemu is in
most parts a state machine, so in tools/kvm/ we can realistically
have a good object manager and keep an exploit in a networking
interface driver from being able to access disk driver state.

(I've Cc:-ed Pekka for tools/kvm/.)

> There has been some discussion of this within the KVM project.
> Using the existing seccomp facility is problematic in that it
> requires significant reworking of Qemu to a privsep model, which
> would also then incur a likely unacceptable context switching
> overhead. The generalized seccomp filter as proposed by Will would
> provide a significant reduction in exposed syscalls and thus
> guest->host attack surface.

... and the event filter based method would *also* allow MAC to be
defined over physical resources, such as virtual network interfaces,
virtual disk devices, etc.

You are seriously limiting the capabilities of this feature for no
good reason i can recognize.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/