Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

From: Andy Lutomirski
Date: Sun May 31 2020 - 21:51:48 EST




> On May 31, 2020, at 4:50 PM, Brendan Shanks <bshanks@xxxxxxxxxxxxxxx> wrote:
>
> ï
>> On May 31, 2020, at 11:57 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> Using SECCOMP_RET_USER_NOTIF is likely to be considerably more
>> expensive than my scheme. On a non-PTI system, my approach will add a
>> few tens of ns to each syscall. On a PTI system, it will be worse.
>> But using any kind of notifier for all syscalls will cause a context
>> switch to a different user program for each syscall, and that will be
>> much slower.
>
> Thereâs also no way (at least to my understanding) to modify register state from SECCOMP_RET_USER_NOTIF, which is how the existing -staging SIGSYS handler works:
>
> <https://github.com/wine-staging/wine-staging/blob/master/patches/ntdll-Syscall_Emulation/0001-ntdll-Support-x86_64-syscall-emulation.patch#L62>
>
>> I think that the implementation may well want to live in seccomp, but
>> doing this as a seccomp filter isn't quite right. It's not a security
>> thing -- it's an emulation thing. Seccomp is all about making
>> inescapable sandboxes, but that's not what you're doing at all, and
>> the fact that seccomp filters are preserved across execve() sounds
>> like it'll be annoying for you.
>
> Definitely. Regardless of what approach is taken, we donât want it to persist across execve.
>
>> What if there was a special filter type that ran a BPF program on each
>> syscall, and the program was allowed to access user memory to make its
>> decisions, e.g. to look at some list of memory addresses. But this
>> would explicitly *not* be a security feature -- execve() would remove
>> the filter, and the filter's outcome would be one of redirecting
>> execution or allowing the syscall. If the "allow" outcome occurs,
>> then regular seccomp filters run. Obviously the exact semantics here
>> would need some care.
>
> Although if thatâs running a BPF filter on every syscall, wouldnât it also incur the ~10% overhead that Paul and Gabriel have seen with existing seccomp?
>
>

Unlikely. Some benchmarking is needed, but the seccomp ptrace overhead is likely *huge* compared to the overhead of just a filter.

As wild guess numbers on made up modern hardware, cache hot:

Empty syscall: 50ns, or 300ns with PTI

Empty syscall accepted by simple seccomp filter: 10ns more than an empty syscall without seccomp

Seccomp ptrace round trip: 6 us Worse with PTI

Seccomp user notif round trip: 4 us

Syscall hypothetically redirected back to same process: about the same as an empty filtered accepted syscall, plus however long it takes to run the handler. Add 900ns if using SIGSYS instead of plain redirection. Add an extra 500ns on current kernels because signal delivery sucks, but I can fix this.

Take these numbers with a huge grain of salt. But the point is that the BPF part is the least of your worries.