Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF

From: Will Drewry
Date: Wed Feb 01 2012 - 04:02:45 EST

On Tue, Jan 31, 2012 at 5:36 PM, Indan Zupancic <indan@xxxxxx> wrote:
> On Tue, January 31, 2012 12:04, Will Drewry wrote:
>> On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic <indan@xxxxxx> wrote:
>>>> I vote for:
>>>> 3. Add tracehook support to all archs.
>> I don't see these #3 as mutually exclusive :)
> They are if you really add tracehook support to all archs. ;-)
>> tracehook requires:
>> - task_pt_regs()          in asm/processor.h or asm/ptrace.h
>> - arch_has_single_step()  if there is hardware single-step support
>> - arch_has_block_step()   if there is hardware block-step support
>> - asm/syscall.h           supplying asm-generic/syscall.h interface
>> - linux/regset.h          user_regset interfaces
>> - CORE_DUMP_USE_REGSET    #define'd in linux/elf.h
>> -TIF_SYSCALL_TRACE       calls tracehook_report_syscall_{entry,exit}
>> - TIF_NOTIFY_RESUME       calls tracehook_notify_resume()
>> - signal delivery         calls tracehook_signal_handler()
> Okay, that's a bit fuzzier than I expected. I suppose the archs implement
> some of that in another way currently?
>>>> Maybe not all archs, but at least some more. That way, every time someone
>>>> adds something tracehook specific, more archs support it.
>> Well the other arch I want this on specifically for my purposes is
>> arm, but someone recently posted a partial asm/syscall.h for arm, but
>> I didn't see that one go anywhere just yet.  (I know syscall_get_nr
>> can be tricky on arm because it doesn't (didn't) have a way of
>> tracking in-syscall state.)
>> ref:
> That totally ignores OABI, it should depend on CONFIG_AEABI and
>>>> syscall.h has no TRACEHOOK defines or anything though.
>> Nope - it is just part of what is expected.
>>>> Only syscall_rollback() looks tricky. I have no clue what the difference
>>>> between syscall_get_error() and syscall_get_return_value() is. But you
>>>> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which
>>>> should be possible for all archs.
>> It seems even syscall_get_nr can have some wrinkles :)
>> ref:
> That's 2009! I wonder why no progress happened since then.
> At least for SECCOMP you get the syscall nr directly as a parameter,
> so you at least don't actually need syscall_get_nr(). Same for ptrace.
> That doesn't help for /proc/$PID/syscall or coredumps though.
> One solution I can think of for the OABI compat problem is to let the
> OABI entry path store the syscall nr in thread_info->syscall and set
> an OABI task flag somewhere. Then syscall_get_nr() can check that to
> decide between r7 or the stored value. This would help /proc and doesn't
> cause any overhead for non-compat syscalls.
> Another ugly way of doing it would be to store the trap instruction for
> OABI calls, and when user space does a PTRACE_PEEK_TEXT on that it returns
> the saved instruction instead of rereading the actual memory. This avoids
> races and lets current user space work without modifications. Problem is
> that user space can't easily check if it's running on a fixed kernel.
> All in all I propose to only support this stuff for kernels not having OABI
> support, as that keeps things nice and simple, which was probably the point
> of moving to a new ARM ABI anyway. Tough, but that's what you get when you
> change ABIs.
>>>> How many archs don't support tracehook?
>> 14 out of 26.  However, 5 of those still have asm/syscall.h
> That's a lot of tricky work.
>>>> Well, the thing is, this recursion is controlled by user space depending
>>>> on how many filters they have installed. What is preventing them to force
>>>> you out of stack?
>> Hrm true.  The easiest option is to just convert it to iterative by
>> not using kref_t, but I'll look more closely.
> Probably.
>>>> So perhaps add at least some arbitrary filter limit to avoid this?
>> Definitely possible -- probably as a sysctl.  I'm not quite sure what
>> number makes sense yet, but I'll look at breaking the recursion first.
>>  Thanks!
> Please no sysctl. The point of arbitrary limits is that they are
> arbitrarily high, but low enough to avoid any problems. A limit
> of 1K filters should be plenty for user space, but still put a
> limit on the total (memory) overhead of filters.

I'd love to avoid a limit at all, but picking something in a sane
range makes perfect sense.

>>>>> I'll clarify a bit.  My original ptrace integration worked such that a
>>>>> tracer may only intercept syscall failures if it attached prior to the
>>>>> failing filter being installed.  I did it this way to avoid using
>>>>> ptrace to escape system call filtering.  However, since I don't have
>>>>> that as part of the patch series, it doesn't make sense to keep it. (I
>>>>> tracked a tracer pid_struct in the filters.)  If it needs to come back
>>>>> with later patchsets, then it can be tackled then!
>>>> The problem of that is that filters can be shared between processes with
>>>> different ptracers. And you have all the hassle of keeping it up to date.
>>>> I think seccomp should always come first and just trust ptrace. This
>>>> because it can deny all ptrace() calls for filtered tasks, so the only
>>>> untrusted tasks doing ptrace() are outside of seccomp's filtering control.
>>>> And those could do the same system calls themselves.
>>>> The case where there is one task being filtered and allowed to do ptrace,
>>>> but not some other syscall, ptracing another filtered task which isn't
>>>> allowed to do ptrace, but allowed to do that other syscall, is quite far
>>>> fetched I think. If you really want to handle this, then you could run
>>>> the ptracer's filters against the tracee's post-ptrace syscall state.
>>>> This is best done in the ptracer's context, just before continuing the
>>>> system call. (You really want Oleg's SIKILL immediate patch then.)
>>>> What about:
>>>> 1) Seccomp filters can deny a syscall by killing the task.
>>>> 2) Seccomp can deny a syscall and let it return an error value.
>>>>   I know you're not fond of this one. It's just a performance
>>>>   optimisation as sometimes a lot of denied but harmless syscalls
>>>>   are called by glibc all the time, like getxattr(). Hardcoding
>>>>   the kill policy seems wrong when it can be avoided. If this is
>>>>   too hard then skip it, but if it's easy to add then please do.
>>>>   I'll take a look at this with your next patch version.
>> It's easy on x86 harder on other arches.  I would suggest saving
>> changing the __secure_computing signature until after the core
>> functionality lands, but that's just me.
> The only problem of that is that you will have two versions of seccomp
> filtering in the wild: One that does support this, and one that doesn't.

Hrm. It'd be the addition of a return value which is essentially an
expansion of the possible filters. It wouldn't make the existing
filters any less valid -- it's just that they wouldn't return errors.

However, I realize there's a desire to have all the pieces in place
upfront. I'll see how much work this really pans out to being. On x86,
it's easy. Some other arches, a little less so, but probably not too

> It seems cleaner to consolidate the seccomp entry path with the ptrace path,
> and do the seccomp check first in there. This saves a few instructions on
> some archs for the seccomp check too and would simplify the entry code of
> most archs.
> Ptrace is expected to change the syscall nr already, so that can be used
> to set it to -1 or something. A different return value than -ENOSYS can
> be set in the syscall exit path, if required. All this can be shared with

That's roughly how it is on x86 now except seccomp is after all the
slow-path copy stuff. It'd be cool to bump it up in front of that
work then pass through its return value. I'll poke around at this and
look at the use of __secure_computing on the other arches to make sure
I understand the impact. Maybe it is easier than I first thought it'd

>>>> 3) Seccomp can allow a syscall to proceed normally.
>>>> 4) Seccomp can set a hint to skip ptrace syscall events for this syscall.
>>>>   A filter sets this by returning a specific value.
>>>> 5) Ptrace always gets a syscall event when it asked for it.
>>>> 6) Ptrace can set an option to honour seccomp's hint and to not get all
>>>>   syscall events.
>>>> This way all seccomp needs to do is to set some flags which ptrace can check.
>> I like the use of flags/options to trigger ptrace handling.  If I were
>> to stack rank these for pursuit after the core functionality lands,
>> it'd be to add #6 (and its deps) then #2.  With #6, #2 can be
>> simulated (by having a supervisor that changes the syscall number to
>> -1), but that is much less ideal than just returning SECCOMP_ERROR
>> instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled
>> up.
> But it would require two filter versions and hence make it more difficult
> to generally support BPF syscall filtering.

True. I'm not quite sure that it makes sense to have the BPF program
decide what the tracer sees or doesn't see. I can see it as a nice
optimization for a sandbox implementation, but I could see something
similar being done purely by letting a tracer catch disallowed
syscalls (possibly via it registering an option indicating it wants
seccomp_events). It wouldn't be quite as flexible, but it would avoid
the simple filter becoming a more complex piece of logic and if a
returned error were allowed (that didn't trigger ptrace), you'd not
see a ridiculous amount of overhead from pointless syscalls. I'm not
sure though.

I'll poke at the ptrace bits too and see if one approach seems to fit
better than another.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at