Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM

From: Kees Cook
Date: Mon Aug 08 2016 - 20:22:45 EST


On Mon, Aug 8, 2016 at 5:00 PM, Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> On Mon, Aug 08, 2016 at 04:44:02PM -0700, Kees Cook wrote:
>> On Thu, Aug 4, 2016 at 12:11 AM, Sargun Dhillon <sargun@xxxxxxxxx> wrote:
>> > I distributed this patchset to linux-security-module@xxxxxxxxxxxxxxx earlier,
>> > but based on the fact that the archive is down, and this is a fairly
>> > broad-sweeping proposal, I figured I'd grow the audience a little bit. Sorry
>> > if you received this multiple times.
>> >
>> > I've begun building out the skeleton of a Linux Security Module, and I'd like to
>> > get feedback on it. It's a skeleton, and I've only populated a few hooks, so I'm
>> > mostly looking for input on the general proposal, interest, and design. It's a
>> > minor LSM. My particular use case is one in which containers are being
>> > dynamically deployed to machines by internal developers in a different group.
>> > The point of Checmate is to act as an extensible bed for _safe_, complex
>> > security policies. It's nice to enable dynamic security policies that can be
>> > defined in C, and change as neccessary, without ever having to patch, or rebuild
>> > the kernel.
>> >
>> > For many of these containers, the security policies can be fairly nuanced. One
>> > particular one to take into account is network security. Often times,
>> > administrators want to prevent ingress, and egress connectivity except from a
>> > few select IPs. Egress filtering can be managed using net_cls, but without
>> > modifying running software, it's non-trivial to attach a filter to all sockets
>> > being created within a container. The inet_conn_request, socket_recvmsg,
>> > socket_sock_rcv_skb hooks make this trivial to implement.
>> >
>> > Other times, containers need to be throttled in places where there's not really
>> > a good place to impose that policy for software which isn't built in-house. If
>> > one wants to limit file creations/sec, or reject I/O under certain
>> > characteristics, there's not a great place to do it now. This gives engineers a
>> > mechanism to write those policies.
>> >
>> > This same flexibility can be used to take existing programs and enable safe BPF
>> > helpers to modify memory to allow rules to pass. One example that I prototyped
>> > was Docker's port mapping, which has an overhead (DNAT), and there's some loss
>> > of fidelity in the BSD Socket API to identify what's going on. Instead, we can
>> > just rewrite the port in a bind, based upon some data in a BPF map, and a cgroup
>> > match.
>> >
>> > I can actually see other minor security modules being implemented in Checmate,
>> > for example, Yama, or the recently proposed Hardchroot could be reimplemented in
>> > BPF. Potentially, they could even be API compatible.
>> >
>> > Although, at first, much of this sounds like seccomp, it's quite different. For
>> > one, what we can do in the security hooks is more complex (access to kernel
>> > pointers). The other side of this is we can have effects on a system-wide,
>> > or cgroup level. This also circumvents the need for CRIU-friendly policies.
>> >
>> > Lastly, the flexibility of this mechanism allows for prevention of security
>> > vulnerabilities which are often complex in nature and require the interaction
>> > of multiple hooks (CVE-2014-9717 is a good example), and although ksplice,
>> > and livepatch exist, they're not always easy to use, as compared to loading
>> > a single bpf program across all kernels.
>> >
>> > The user-facing API is exposed via prctl as it's meant to be very simple (at
>> > least the kernel components). It only has three operations. For a given security
>> > hook, you can attach a BPF program to it, which will add it to the set of
>> > programs that are executed over when the hook is hit. You can reset a hook,
>> > which removes all program associated with a given hook, and you can set a
>> > deny_reset flag on a hook to prevent anyone from resetting it. It's likely that
>> > an individual would want to set this in any production use case.
>>
>> One fairly serious problem that seccomp had to overcome was dealing
>> with exec+setuid in the face of an attacker. The main example is "what
>> if we refuse to allow a program to drop privileges via a filter rule?"
>> For seccomp, no-new-privs was introduced for non-root users of
>> seccomp. Programmatic syscall (or LSM) filters need to deal with this,
>> and it's a bit ungainly. :)
>>
> Couldn't someone do the same with SELinux, or Apparmor?

The "big" LSMs aren't defined programmatically by non-root users, so
there is no risk of elevating privileges (they are already root).

>> Also, if you have a prctl API that already has 3 operations, you might
>> want to use a new syscall anyway. :)
>>
> Looking at other LSMs, they appear to expose their API via a virtual filesystem,
> or prctl. I followed the model of YAMA. I think there may be two more operations
> (detach program, and mark a hook as append-only / read-only / disabled). It
> seems like overkill to implement my own syscall.
>
>> > On the BPF side of it, all that's involved in the work in progress is to
>> > move some of the tracing helpers into the shared helpers. For example,
>> > it's very valuable to have access to current when enforcing a hook.
>> > BPF programs also have access to maps, which somewhat works around
>> > the need for security blobs in some cases.
>>
>> Just from a compatibility perspective, doesn't this end up exposing
>> kernel structures to userspace? What happens when the structures
>> change?
>>
> I wouldn't consider BPF userspace. Although it executes in the kernel, I
> wouldn't really consider it kernel space either as it's restricted to safe
> operations.
>
> As far as addressing this issue -- A significant part of the LSM hooks API is
> tied to the syscall, giving stability to those datastructures.

Just for the sake of clarity: they're tied to internal callers,
usually near syscall entry points; LSMs can't filter syscalls.

> If you look at
> the API itself a significant part of it has been untouched for 3+ years, and
> it's been even longer since there has been an API breaking change. On the other
> hand, the developer has the ability to perform arbitrary reads of kernel space
> using bpf_probe_read.

What's hilarious is that syscall API is unchanged, but LSM API keeps
shifting around a little at a time. So, same issues as with kprobes,
etc, as you mention.

FWIW, I'd much rather have an LSM that reacts to seccomp filters and
maps syscall arguments to in-kernel data structures that can be
examined during an LSM hook. Then we'd have both a stable API and a
programmatic filtering of data structures.

> This is addressed in the 4th patch, which requires the BPF program is compiled
> against the current kernel version. The userspace policy orchestration code
> should recompile the BPF program on the fly matching the current kernel's
> datastructures. There's a certain level of rope here given to the operator,
> and it's expected that they use it carefully. Similarly, folks could load
> kprobes, kmods, and other programs that have the same issues.

Right, perhaps I misunderstood the privilege level you were targeting.
:) Did you intend for unprivileged users to use this, or just the
init-ns root user?

>
>> And from a security perspective, programmatic examination of kernel
>> structures means you can trivially leak kernel memory locations and
>> contents. Resisting these sorts of leaks needs to be addressed too.
>>
> I'm unsure of that unintentional exfiltration of kernel memory locations is
> possible. You may be able to via a BPF map or similar (logging). What kinds of
> attacks are you thinking about specifically?

Well, I was looking at the example you sent, and it seemed like it had
raw access to kernel pointers, which means it could be programmed to
leak the values.

>> This looks like a subset of kprobes but available to non-root users,
>> which looks rather scary to me at first glance. :)
> You need CAP_SYS_ADMIN to touch this. These folks are the same ones that control
> SELinux, and Apparmor.

Ah-ha, missed that. Still, we want to keep a bright line between uid-0
and ring-0, and to make sure this is just init-ns CAP_SYS_ADMIN.

-Kees

>
>>
>> -Kees
>>
>> >
>> > I would love to know what y'all think.
>> >
>> > Sargun Dhillon (4):
>> > bpf: move tracing helpers to shared helpers
>> > bpf, security: Add Checmate
>> > security/checmate: Add Checmate sample
>> > bpf: Restrict Checmate bpf programs to current kernel ABI
>> >
>> > include/linux/bpf.h | 2 +
>> > include/linux/checmate.h | 38 +++++
>> > include/uapi/linux/Kbuild | 1 +
>> > include/uapi/linux/bpf.h | 1 +
>> > include/uapi/linux/checmate.h | 65 +++++++++
>> > include/uapi/linux/prctl.h | 3 +
>> > kernel/bpf/helpers.c | 34 +++++
>> > kernel/bpf/syscall.c | 2 +-
>> > kernel/trace/bpf_trace.c | 33 -----
>> > samples/bpf/Makefile | 4 +
>> > samples/bpf/bpf_load.c | 11 +-
>> > samples/bpf/checmate1_kern.c | 28 ++++
>> > samples/bpf/checmate1_user.c | 54 +++++++
>> > security/Kconfig | 1 +
>> > security/Makefile | 2 +
>> > security/checmate/Kconfig | 6 +
>> > security/checmate/Makefile | 3 +
>> > security/checmate/checmate_bpf.c | 67 +++++++++
>> > security/checmate/checmate_lsm.c | 304 +++++++++++++++++++++++++++++++++++++++
>> > 19 files changed, 622 insertions(+), 37 deletions(-)
>> > create mode 100644 include/linux/checmate.h
>> > create mode 100644 include/uapi/linux/checmate.h
>> > create mode 100644 samples/bpf/checmate1_kern.c
>> > create mode 100644 samples/bpf/checmate1_user.c
>> > create mode 100644 security/checmate/Kconfig
>> > create mode 100644 security/checmate/Makefile
>> > create mode 100644 security/checmate/checmate_bpf.c
>> > create mode 100644 security/checmate/checmate_lsm.c
>> >
>> > --
>> > 2.7.4
>> >
>>
>>
>>
>> --
>> Kees Cook
>> Nexus Security



--
Kees Cook
Nexus Security