Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

From: Jann Horn
Date: Tue Jun 16 2020 - 08:15:25 EST


On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases. While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
>
> In the past, efforts have been focused on making filter execution complete
> in a shorter amount of time. For example, filters were rewritten from
> using linear if/then/else syscall search to using balanced binary trees,
> or moving tests for syscalls common to the process's workload to the
> front of the filter. However, there are limits to this, especially when
> some processes are dealing with tens of filters[2], or when some
> architectures have a less efficient BPF engine[3].
>
> The most common use of seccomp, constructing syscall block/allow-lists,
> where syscalls that are always allowed or always rejected (without regard
> to any arguments), also tends to produce the most pathological runtime
> problems, in that a large number of syscall checks in the filter need
> to be performed to come to a determination.
>
> In order to optimize these cases from O(n) to O(1), seccomp can
> use bitmaps to immediately determine the desired action. A critical
> observation in the prior paragraph bears repeating: the common case for
> syscall tests do not check arguments. For any given filter, there is a
> constant mapping from the combination of architecture and syscall to the
> seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> there is a single architecture.). As such, it is possible to construct
> a mapping of arch/syscall to action, which can be updated as new filters
> are attached to a process.
>
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result.
>
> During syscall action probing, in order to determine whether other members
> of struct seccomp_data are being accessed during a filter execution,
> the struct is placed across a page boundary with the "arch" and "nr"
> members in the first page, and everything else in the second page. The
> "page accessed" flag is cleared in the second page's PTE, and the filter
> is run. If the "page accessed" flag appears as set after running the
> filter, we can determine that the filter looked beyond the "arch" and
> "nr" members, and exclude that syscall from the constant action bitmaps.
>
> For architectures to support this optimization, they must declare
> their architectures for seccomp to see (via SECCOMP_ARCH and
> SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
> CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
> and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Wouldn't it be simpler to use a function that can run a subset of
seccomp cBPF and bails out on anything that indicates that a syscall's
handling is complex or on instructions it doesn't understand? For
syscalls that have a fixed policy, a typical seccomp filter doesn't
even use any of the BPF_ALU ops, the scratch space, or the X register;
it just uses something like the following set of operations, which is
easy to emulate without much code:

BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K

Something like (completely untested):

/*
* Try to statically determine whether @filter will always return a fixed result
* when run for syscall @nr under architecture @arch.
* Returns true if the result could be determined; if so, the result will be
* stored in @action.
*/
static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch,
unsigned int nr, unsigned int *action)
{
int pc;
unsigned int reg_value = 0;

for (pc = 0; 1; pc++) {
struct sock_filter *insn = &filter[pc];
u16 code = insn->code;
u32 k = insn->k;

switch (code) {
case BPF_LD | BPF_W | BPF_ABS:
if (k == offsetof(struct seccomp_data, nr)) {
reg_value = nr;
} else if (k == offsetof(struct seccomp_data, arch)) {
reg_value = arch;
} else {
return false; /* can't optimize (non-constant value load) */
}
break;
case BPF_RET | BPF_K:
*action = insn->k;
return true; /* success: reached return with constant values only */
case BPF_JMP | BPF_JA:
pc += insn->k;
break;
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JGE | BPF_K:
case BPF_JMP | BPF_JGT | BPF_K:
default:
if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) {
u16 op = BPF_OP(code);
bool op_res;

switch (op) {
case BPF_JEQ:
op_res = reg_value == k;
break;
case BPF_JGE:
op_res = reg_value >= k;
break;
case BPF_JGT:
op_res = reg_value > k;
break;
default:
return false; /* can't optimize (unknown insn) */
}

pc += op_res ? insn->jt : insn->jf;
break;
}
return false; /* can't optimize (unknown insn) */
}
}
}

That way, you won't need any of this complicated architecture-specific stuff.