Re: Using ftrace/perf as a basis for generic seccomp

From: Frederic Weisbecker
Date: Thu Feb 03 2011 - 14:18:57 EST


On Thu, Feb 03, 2011 at 08:06:45PM +0100, Frederic Weisbecker wrote:
> On Wed, Feb 02, 2011 at 11:45:22AM -0500, Eric Paris wrote:
> > On Wed, 2011-02-02 at 13:26 +0100, Ingo Molnar wrote:
> > > * Masami Hiramatsu <masami.hiramatsu.pt@xxxxxxxxxxx> wrote:
> > >
> > > > Hi Eric,
> > > >
> > > > (2011/02/01 23:58), Eric Paris wrote:
> > > > > On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <eparis@xxxxxxxxxx> wrote:
> > > > >> Some time ago Adam posted a patch to allow for a generic seccomp
> > > > >> implementation (unlike the current seccomp where your choice is all
> > > > >> syscalls or only read, write, sigreturn, and exit) which got little
> > > > >> traction and it was suggested he instead do the same thing somehow using
> > > > >> the tracing code:
> > > > >> http://thread.gmane.org/gmane.linux.kernel/833556
> > > >
> > > > Hm, interesting idea :)
> > > > But why would you like to use tracing code? just for hooking?
> > >
> > > What I suggested before was to reuse the scripting engine and the tracepoints.
> > >
> > > I.e. the "seccomp restrictions" can be implemented via a filter expression - and the
> > > scripting engine could be generalized so that such 'sandboxing' code can make use of
> > > it.
> > >
> > > For example, if you want to restrict a process to only allow open() syscalls to fd 4
> > > (a very restrictive sandbox), it could be done via this filter expression:
> > >
> > > 'fd == 4'
> > >
> > > etc. Note that obviously the scripting engine needs to be abstracted out somewhat -
> > > but this is the basic idea, to reuse the callbacks and reuse the scripting engine
> > > for runtime filtering of syscall parameters.
> >
> > Any pointers on what is involved in this abstraction? I can work out
> > the details, but I don't know the big picture well enough to even start
> > to move forwards.....
>
> In the big picture, the filtering code is very tight to the tracing code.
> Creation, initialization, removal of filters is all made on top of the
> trace events structures (struct ftrace_event_call) because we apply and
> interpret filters on the fields of trace events, which are what we save
> in a trace.
>
> Example:
>
> If you look at the sched switch trace events, we have several fields
> like prev_comm and next_comm. These are defined in the TRACE_EVENT()
> macros calls. So when we apply a filter like "prev_comm == firefox-bin",
> we enter the filtering code with the trace_event structure for sched
> switch events and iterate through its fields to find one called
> prev_comm and then we work on top of that.
> I think you won't work with trace events, so you need to make the
> filtering code more tracing-agnostic.
>
> But I think it's quite workable and shouldn't be too hard to split that
> into a filtering backend. Many parts are already pretty standalone.
>
> Also I suspect the tracepoints are not what you need. Or may be
> they are. But as Masami said, the syscall tracepoint is called late.
> It's workable though. The other problem is that preemption is disabled
> when tracepoints are called, which is probably not what you want.
> One day I think we'll need to unify the tracepoints and notifier
> code but until then, better keep tracepoints for tracing.
>
> Now once you have the filtering code more generic, you still
> need an arch backend to map register contents and layout into syscall
> arguments name and type. On top of which you can finally use the filtering
> code. For that you can use, again, some code we use for tracing, which
> are syscalls metadata: informations generated on build time
> that have syscalls fields and type.
> And that also needs to be split up, but it's more trivial
> than the filtering part.
>
> Note for now, filtering + syscalls metadata only works on top
> of raw arguments value. Syscalls metadata don't know much
> about type semantics and won't help you to dereference
> syscall argument pointers. Only raw syscall parameter values.
> Similarly, the filtering code can't evaluate pointer dereferencing
> expression evaluation, only direct values comprehension.

Actually we have string comparison supported by the filtering code.
Still we need safe accessors (copy_from_user()) from filtering code
to use that safely on syscall parameters.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/