Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is andhow it works.

From: Will Drewry
Date: Wed May 04 2011 - 05:29:27 EST


On Wed, May 4, 2011 at 2:15 AM, Will Drewry <wad@xxxxxxxxxxxx> wrote:
> On Mon, May 2, 2011 at 6:47 PM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
>> 2011/5/3 Frederic Weisbecker <fweisbec@xxxxxxxxx>:
>>> On Fri, Apr 29, 2011 at 11:13:44AM -0500, Will Drewry wrote:
>>>> That said, I have a general interface question :)  Right now I have:
>>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_ADD, syscall_nr, filter_string);
>>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, syscall_nr,
>>>> filter_string_or_NULL);
>>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_APPLY, apply_flags);
>>>>   (I will change this to default to apply_on_exec and let FILTER_APPLY
>>>> make it apply _now_ exclusively. :)
>>>>
>>>> This can easily be mapped to:
>>>> prctl(PR_SET_SECCOMP
>>>>        PR_SET_SECOMP_FILTER_ADD
>>>>        PR_SET_SECOMP_FILTER_DROP
>>>>        PR_SET_SECOMP_FILTER_APPLY
>>>> if that'd be preferable (to keep it all in the prctl.h world).
>>>>
>>>> Following along the suggestion of reducing custom parsing, it seemed
>>>> to make a lot of sense to make add and drop actions very explicit.
>>>> There is no guesswork so a system call filtered process will only be
>>>> able to perform DROP operations (if prctl is allowed) to reduce the
>>>> allowed system calls.  This also allows more fine grained flexibility
>>>> in addition to the in-kernel complexity reduction.  E.g.,
>>>> Process starts with
>>>>   __NR_read, "fd == 1"
>>>>   __NR_read, "fd == 2"
>>>> later it can call:
>>>>   prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, __NR_read, "fd == 2");
>>>> to drop one of the filters without disabling "fd == 1" reading.  (Or
>>>> it could pass in NULL to drop all filters).
>>>
>>> Hm, but then you don't let the childs be able to restrict further
>>> what you allowed before.
>>>
>>> Say I have foo(int a, int b), and I apply these filters:
>>>
>>>        __NR_foo, "a == 1";
>>>        __NR_foo, "a == 2";
>>>
>>> This is basically "a == 1 || a == 2".
>>>
>>> Now I apply the filters and I fork. How can the child
>>> (or current task after the filter is applied) restrict
>>> further by only allowing "b == 2", such that with the
>>> inherited parent filters we have:
>>>
>>>        "(a == 1 || a == 2) && b == 2"
>>>
>>> So what you propose seems to me too limited. I'd rather have this:
>>>
>>> SECCOMP_FILTER_SET = remove previous filter entirely and set a new one
>>> SECCOMP_FILTER_GET = get the string of the current filter
>>>
>>> The rule would be that you can only set a filter that is intersected
>>> with the one that was previously applied.
>>>
>>> It means that if you set filter A and you apply it. If you want to set
>>> filter B thereafter, it must be:
>>>
>>>        A && B
>>>
>>> OTOH, as long as you haven't applied A, you can override it as you wish.
>>> Like you can have "A || B" instead. Or you can remove it with "1". Of course
>>> if a previous filter was applied before A, then your new filter must be
>>> concatenated: "previous && (A || B)".
>>>
>>> Right? And note in this scheme you can reproduce your DROP trick. If
>>> "A || B" is the current filter applied, then you can restrict B by
>>> doing: "(A || B) && A".
>>>
>>> So the role of SECCOMP_FILTER_GET is to get the string that matches
>>> the current applied filter.
>>>
>>> The effect of this is infinite of course. If you apply A, then apply
>>> B then you need A && B. If later you want to apply C, then you need
>>> A && B && C, etc...
>>>
>>> Does that look sane?
>>>
>>
>> Even better: applying a filter would always automatically be an
>> intersection of the previous one.
>>
>> If you do:
>>
>> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
>> SECCOMP_FILTER_APPLY
>> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
>> SECCOMP_FILTER_APPLY
>> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>> SECCOMP_FILTER_APPLY
>>
>> The end result is:
>>
>> "(a == 1 || a == 2) && b == 2  && c == 3"
>>
>> So that we don't push the burden in the kernel to compare the applied
>> expression with a new one that may or may not be embraced by parenthesis
>> and other trickies like that. We simply append to the working one.
>>
>> Ah and OTOH this:
>>
>> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
>> SECCOMP_FILTER_APPLY
>> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
>> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>>
>> has the following end result:
>>
>> "(a == 1 || a == 2) && c == 3"
>>
>> As long as you don't apply the filter, the temporary part is
>> overriden, but still we keep
>> the applied part.
>>
>> Still sane? (or completely nuts?)
>
> Okay - so I *think* I'm following.  I really like the use of SET and
> GET to allow for further constraint based on additional argument
> restrictions instead of purely reducing the filters available.  The
> only part I'm stumbling on is using APPLY on a per-filter basis.  In
> my current implementation, I consider APPLY to be the global enable
> bit.  Whatever filters are set become set in stone and only &&s are
> handled.  I'm not sure I understand why it would make sense to do a
> per-syscall-filter apply call.  It's certainly doable, but it will
> mean that we may be logically storing something like:
>
> __NR_foo: (a == 1 || a == 2), applied
> __NR_foo: b == 2, not applied
> __NR_foo: c == 3, not applied
>
> after
>
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>
> In that case, would a call to sys_foo even be tested against the
> non-applied constraints of b==2 or c==3? Or would the call to set "c
> == 3" replace the "b == 2" entry.  I'm not sure I see that the benefit
> exceeds the ambiguity that might introduce.  However, if the default
> behavior it to always extend with &&, then a consumer of the interface
> could just do:
>
> SECCOMP_FILTER_SET, __NR_prctl, "option == 2"
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_APPLY
>
> This would yield a final filter for foo of "(a == 1 || a == 2) && b ==
> 2".  The call to APPLY would initiate the enforcement of the syscall
> filtering and enforce that no new filters may be added for syscalls
> that aren't already constrained.  So you could still call
>
> SECCOMP_FILTER_SET, __NR_foo, "0"
>
> which is logically "((a == 1 || a == 2) && b == 2) && 0" and would be
> interpreted as just a DROP. But you could not do,
>
> SECCOMP_FILTER_SET, __NR_foo, "0"
> SECCOMP_FILTER_SET, __NR_foo, "1"
> or
> SECCOMP_FILTER_SET, __NR_read, "1"
>
> (since the implicit filter for all syscalls after an APPLY call should
> be "0" and additions would just be "0 && whatever").
>
> Am I missing something?  If that makes sense, then we may even be able
> to reduce the extra directives by one and get a resulting interface
> that looks something like:
>
> /* Appends (&&) a new filter string for a syscall to the current
> filter value. "0" clears the filter. */
> prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter_string);
> /* Returns the current explicit filter string for a syscall */
> prctl(PR_GET_SECCOMP_FILTER, syscall_nr, buf, buflen);
> /* Transition to a secure computing mode:
>  * 1 - enables traditional seccomp behavior
>  * 2 - enables seccomp filter enforcement and changes the implicit
> filter for syscalls from "1" to "0"
>  */
> prctl(PR_SET_SECCOMP, mode);
>
> I'll set aside the v2 of the patch that uses ADD, DROP, and APPLY and
> work up a version with this interface.  Do you (or anyone else :) feel
> strongly about per-syscall APPLY?  I like the above version, but I'm
> certainly willing to explore the other path.  As is, I'll go through
> my usecases (and tests once I have a new cut of the patch) and see how
> it feels.  At first blush, this appears more succinct _and_ more
> expressive than the prior versions!
>
> ~~
>
> As to the use of apply_on_exec, even if you whitelist: mmap, fstat64,
> brk, uname, open, read, close, set_thread_area, mprotect, munmap, and
> access _just_ to allow a process to be exec'd, it is still a
> significant reduction in kernel attack surface.  Pairing that with a
> LSM, delayed chroot, etc could fill in the gaps with respect to a
> greater sandboxing solution.  I'd certainly take that tradeoff for
> running binaries that I don't control the source for :)  That said,
> LD_PRELOAD, ptrace injection, and other tricks could allow for the
> injection of a very targeted filterset, but I don't think that
> invalidates the on-exec case given the brittleness relying exclusively
> on such tactics.  Does that seem reasonable?

With a little more thinking :), I don't think there's an obvious hook
for "on-exec" bit in the interface proposed above. I think that might
be a worthwhile tradeoff given how much cleaner it is. It'd still be
possible to write a launcher to provide a reduced kernel surface, it'd
just also need a filter for the exec syscall.

cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/