Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC

From: Andy Lutomirski
Date: Tue Jan 14 2014 - 15:24:45 EST

Next message: Andy Lutomirski: "Re: [PATCH v5 13/14] locks: skip deadlock detection on FL_FILE_PVT locks"
Previous message: Peter Zijlstra: "Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()"
In reply to: Will Drewry: "Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC"
Next in thread: Will Drewry: "Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jan 14, 2014 at 10:59 AM, Will Drewry <wad@xxxxxxxxxxxx> wrote:
> On Mon, Jan 13, 2014 at 5:36 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On 01/13/2014 12:30 PM, Will Drewry wrote:
>>> Applying restrictive seccomp filter programs to large or diverse
>>> codebases often requires handling threads which may be started early in
>>> the process lifetime (e.g., by code that is linked in). While it is
>>> possible to apply permissive programs prior to process start up, it is
>>> difficult to further restrict the kernel ABI to those threads after that
>>> point.
>>>
>>> This change adds a new seccomp "extension" for synchronizing thread
>>> group seccomp filters and a prctl() for accessing that functionality.
>>> The need for the added prctl() is due to the lack of reserved arguments
>>> in PR_SET_SECCOMP.
>>>
>>> When prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0) is called, it
>>> will attempt to synchronize all threads in current's threadgroup to its
>>> seccomp filter program. This is possible iff all threads are using a
>>> filter that is an ancestor to the filter current is attempting to
>>> synchronize to. NULL filters (where the task is running as
>>> SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
>>> transitioned into SECCOMP_MODE_FILTER. On success, 0 is returned. On
>>> failure, the pid of one of the failing threads will be returned.
>>>
>>> Suggested-by: Julien Tinnes <jln@xxxxxxxxxxxx>
>>> Signed-off-by: Will Drewry <wad@xxxxxxxxxxxx>
>>> ---
>>> include/linux/seccomp.h | 7 +++
>>> include/uapi/linux/prctl.h | 6 ++
>>> include/uapi/linux/seccomp.h | 6 ++
>>> kernel/seccomp.c | 128 ++++++++++++++++++++++++++++++++++++++++++
>>> kernel/sys.c | 3 +
>>> 5 files changed, 150 insertions(+)
>>>
>>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>>> index 85c0895..3163db6 100644
>>> --- a/include/linux/seccomp.h
>>> +++ b/include/linux/seccomp.h
>>> @@ -77,6 +77,8 @@ static inline int seccomp_mode(struct seccomp *s)
>>> extern void put_seccomp_filter(struct task_struct *tsk);
>>> extern void get_seccomp_filter(struct task_struct *tsk);
>>> extern u32 seccomp_bpf_load(int off);
>>> +extern long prctl_seccomp_ext(unsigned long, unsigned long,
>>> + unsigned long, unsigned long);
>>> #else /* CONFIG_SECCOMP_FILTER */
>>> static inline void put_seccomp_filter(struct task_struct *tsk)
>>> {
>>> @@ -86,5 +88,10 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
>>> {
>>> return;
>>> }
>>> +static inline long prctl_seccomp_ext(unsigned long arg2, unsigned long arg3,
>>> + unsigned long arg4, unsigned long arg5)
>>> +{
>>> + return -EINVAL;
>>> +}
>>> #endif /* CONFIG_SECCOMP_FILTER */
>>> #endif /* _LINUX_SECCOMP_H */
>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>> index 289760f..5dcd5d3 100644
>>> --- a/include/uapi/linux/prctl.h
>>> +++ b/include/uapi/linux/prctl.h
>>> @@ -149,4 +149,10 @@
>>>
>>> #define PR_GET_TID_ADDRESS 40
>>>
>>> +/*
>>> + * Access seccomp extensions
>>> + * See Documentation/prctl/seccomp_filter.txt for more details.
>>> + */
>>> +#define PR_SECCOMP_EXT 41
>>> +
>>> #endif /* _LINUX_PRCTL_H */
>>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>>> index ac2dc9f..49b5279 100644
>>> --- a/include/uapi/linux/seccomp.h
>>> +++ b/include/uapi/linux/seccomp.h
>>> @@ -10,6 +10,12 @@
>>> #define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
>>> #define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
>>>
>>> +/* Valid extension types as arg2 for prctl(PR_SECCOMP_EXT) */
>>> +#define SECCOMP_EXT_ACT 1
>>> +
>>> +/* Valid extension actions as arg3 to prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT) */
>>> +#define SECCOMP_EXT_ACT_TSYNC 1 /* attempt to synchronize thread filters */
>>> +
>>> /*
>>> * All BPF programs must return a 32-bit value.
>>> * The bottom 16-bits are for optional return data.
>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>> index 71512e4..8a0de7b 100644
>>> --- a/kernel/seccomp.c
>>> +++ b/kernel/seccomp.c
>>> @@ -24,6 +24,7 @@
>>> #ifdef CONFIG_SECCOMP_FILTER
>>> #include <asm/syscall.h>
>>> #include <linux/filter.h>
>>> +#include <linux/pid.h>
>>> #include <linux/ptrace.h>
>>> #include <linux/security.h>
>>> #include <linux/slab.h>
>>> @@ -220,6 +221,108 @@ static u32 seccomp_run_filters(int syscall)
>>> return ret;
>>> }
>>>
>>> +/* Returns 1 if the candidate is an ancestor. */
>>> +static int is_ancestor(struct seccomp_filter *candidate,
>>> + struct seccomp_filter *child)
>>> +{
>>> + /* NULL is the root ancestor. */
>>> + if (candidate == NULL)
>>> + return 1;
>>> + for (; child; child = child->prev)
>>> + if (child == candidate)
>>> + return 1;
>>> + return 0;
>>> +}
>>> +
>>> +/**
>>> + * seccomp_sync_threads: sets all threads to use current's filter
>>> + *
>>> + * Returns 0 on success or the pid of a thread which was either not
>>> + * in the correct seccomp mode or it did not have an ancestral
>>> + * seccomp filter. current must be in seccomp.mode=2 already.
>>> + */
>>> +static pid_t seccomp_sync_threads(void)
>>> +{
>>> + struct task_struct *thread, *caller;
>>> + pid_t failed = 0;
>>> + thread = caller = current;
>>> +
>>> + read_lock(&tasklist_lock);
>>> + if (thread_group_empty(caller))
>>> + goto done;
>>> + while_each_thread(caller, thread) {
>>> + task_lock(thread);
>>> + /*
>>> + * All threads must not be in SECCOMP_MODE_STRICT to
>>> + * be eligible for synchronization.
>>> + */
>>> + if ((thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
>>> + thread->seccomp.mode == SECCOMP_MODE_FILTER) &&
>>> + is_ancestor(thread->seccomp.filter,
>>> + caller->seccomp.filter)) {
>>> + /* Get a task reference for the new leaf node. */
>>> + get_seccomp_filter(caller);
>>> + /*
>>> + * Drop the task reference to the shared ancestor since
>>> + * current's path will hold a reference. (This also
>>> + * allows a put before the assignment.)
>>> + */
>>> + put_seccomp_filter(thread);
>>> + thread->seccomp.filter = caller->seccomp.filter;
>>> + /* Opt the other thread into seccomp if needed.
>>> + * As threads are considered to be trust-realm
>>> + * equivalent (see ptrace_may_access), it is safe to
>>> + * allow one thread to transition the other.
>>> + */
>>> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
>>> + thread->seccomp.mode = SECCOMP_MODE_FILTER;
>>> + /*
>>> + * Don't let an unprivileged task work around
>>> + * the no_new_privs restriction by creating
>>> + * a thread that sets it up, enters seccomp,
>>> + * then dies.
>>> + */
>>> + if (caller->no_new_privs)
>>> + thread->no_new_privs = 1;
>>> + set_tsk_thread_flag(thread, TIF_SECCOMP);
>>
>> no_new_privs is a bitfield, and some of the other bits in there look
>> like things that might not want to be read and written back from another
>> thread.
>
> Ah :/ Good catch!
>
>> Would it be too annoying to require that the other threads already have
>> no_new_privs set?
>
> Hrm, it's pretty painful in the edge cases where you don't control the
> process initialization which might setup threads you need to ensnare.
>
> Would it be crazy to do something like below in sched.h?
> - unsigned no_new_privs:1;
> + unsigned no_new_privs;

set_bit, etc. would also work. (Although there isn't a 32-bit set_bit
AFAIK, or at least there isn't one that works on 64-bit BE archs.)

Also, is 'unsigned' actually safe for this purpose, on all supported
archs/compilers? I'm pretty sure it's okay by C++11 rules, but those
don't apply here. Maybe some day the kernel will move to C11 and life
will be good.

>
> It feels like a big hammer though, but it also seems weird to wrap those
> bitfields with task_lock. Any suggestions are welcome! I'll think about
> this a bit more and see if there is a good way to do this transition
> safely and cheaply.

Hmm. I bet you could move no_new_privs somewhere else in task_lock
where there's a bit free. It could also go in 'struct creds', but I
think that's even worse from your perspective.

Here's another dumb idea: Add an accessor task_no_new_privs(struct
task_struct *) and move no_new_privs into struct seccomp (i.e. make it
a bit in the seccomp mode). It kind of sucks on !CONFIG_SECCOMP, but
it's free if CONFIG_SECCOMP.

P.S. Have you seen the Linux Capsicum port? It fiddles with seccomp
mode, too, and I suspect it needs a fair amount of work, but I really
like the idea.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andy Lutomirski: "Re: [PATCH v5 13/14] locks: skip deadlock detection on FL_FILE_PVT locks"
Previous message: Peter Zijlstra: "Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()"
In reply to: Will Drewry: "Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC"
Next in thread: Will Drewry: "Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]