[PATCH bpf] bpf, audit, lockdown: Fix bogus SELinux lockdown permission checks

From: Daniel Borkmann
Date: Fri May 28 2021 - 05:16:31 EST


Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:

i) The audit events that are triggered due to calls to security_locked_down()
can OOM kill a machine, see below details [0].

ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end()
when presumingly trying to wake up kauditd [1].

Fix both at the same time by taking a completely different approach, that is,
move the check into the program verification phase where we actually retrieve
the func proto. This also reliably gets the task (current) that is trying to
install the tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also
fixes the OOM since we're moving this out of the BPF helpers which can be called
millions of times per second.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:

I starting seeing this with F-34. When I run a container that is traced with
BPF to record the syscalls it is doing, auditd is flooded with messages like:

type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality }
for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
tclass=lockdown permissive=0

This seems to be leading to auditd running out of space in the backlog buffer
and eventually OOMs the machine.

[...]
auditd running at 99% CPU presumably processing all the messages, eventually I get:
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[...]

[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@xxxxxxxxxxxxxx/,
Serhei Makarov says:

Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
testsuite on x86_64 as well as the runqlat,runqslower tools from bcc on
ppc64le. Example stack trace:

[...]
[ 730.868702] stack backtrace:
[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[ 730.873278] Call Trace:
[ 730.873770] dump_stack+0x7f/0xa1
[ 730.874433] check_noncircular+0xdf/0x100
[ 730.875232] __lock_acquire+0x1202/0x1e10
[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
[ 730.876844] lock_acquire+0xc2/0x3a0
[ 730.877551] ? __wake_up_common_lock+0x52/0x90
[ 730.878434] ? lock_acquire+0xc2/0x3a0
[ 730.879186] ? lock_is_held_type+0xa7/0x120
[ 730.880044] ? skb_queue_tail+0x1b/0x50
[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
[ 730.881656] ? __wake_up_common_lock+0x52/0x90
[ 730.882532] __wake_up_common_lock+0x52/0x90
[ 730.883375] audit_log_end+0x5b/0x100
[ 730.884104] slow_avc_audit+0x69/0x90
[ 730.884836] avc_has_perm+0x8b/0xb0
[ 730.885532] selinux_lockdown+0xa5/0xd0
[ 730.886297] security_locked_down+0x20/0x40
[ 730.887133] bpf_probe_read_compat+0x66/0xd0
[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
[ 730.888917] trace_call_bpf+0xe9/0x240
[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
[ 730.890579] perf_trace_sched_switch+0x142/0x180
[ 730.891485] ? __schedule+0x6d8/0xb20
[ 730.892209] __schedule+0x6d8/0xb20
[ 730.892899] schedule+0x5b/0xc0
[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
[...]

Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@xxxxxxxxxx>
Reported-by: Jakub Hrozek <jhrozek@xxxxxxxxxx>
Reported-by: Serhei Makarov <smakarov@xxxxxxxxxx>
Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
Cc: Stephen Smalley <sds@xxxxxxxxxxxxx>
Cc: Jerome Marchand <jmarchan@xxxxxxxxxx>
Cc: Frank Eigler <fche@xxxxxxxxxx>
Cc: Jiri Olsa <jolsa@xxxxxxxxxx>
Cc: Paul Moore <paul@xxxxxxxxxxxxxx>
---
kernel/bpf/helpers.c | 6 ++++--
kernel/trace/bpf_trace.c | 36 +++++++++++++-----------------------
2 files changed, 17 insertions(+), 25 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 73443498d88f..6f6e090c5310 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1069,11 +1069,13 @@ bpf_base_func_proto(enum bpf_func_id func_id)
case BPF_FUNC_probe_read_user:
return &bpf_probe_read_user_proto;
case BPF_FUNC_probe_read_kernel:
- return &bpf_probe_read_kernel_proto;
+ return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+ NULL : &bpf_probe_read_kernel_proto;
case BPF_FUNC_probe_read_user_str:
return &bpf_probe_read_user_str_proto;
case BPF_FUNC_probe_read_kernel_str:
- return &bpf_probe_read_kernel_str_proto;
+ return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+ NULL : &bpf_probe_read_kernel_str_proto;
case BPF_FUNC_snprintf_btf:
return &bpf_snprintf_btf_proto;
case BPF_FUNC_snprintf:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d2d7cf6cfe83..3df43d89d642 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -215,16 +215,10 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = {
static __always_inline int
bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr)
{
- int ret = security_locked_down(LOCKDOWN_BPF_READ);
+ int ret = copy_from_kernel_nofault(dst, unsafe_ptr, size);

if (unlikely(ret < 0))
- goto fail;
- ret = copy_from_kernel_nofault(dst, unsafe_ptr, size);
- if (unlikely(ret < 0))
- goto fail;
- return ret;
-fail:
- memset(dst, 0, size);
+ memset(dst, 0, size);
return ret;
}

@@ -246,11 +240,6 @@ const struct bpf_func_proto bpf_probe_read_kernel_proto = {
static __always_inline int
bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr)
{
- int ret = security_locked_down(LOCKDOWN_BPF_READ);
-
- if (unlikely(ret < 0))
- goto fail;
-
/*
* The strncpy_from_kernel_nofault() call will likely not fill the
* entire buffer, but that's okay in this circumstance as we're probing
@@ -260,13 +249,10 @@ bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr)
* code altogether don't copy garbage; otherwise length of string
* is returned that can be used for bpf_perf_event_output() et al.
*/
- ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size);
- if (unlikely(ret < 0))
- goto fail;
+ int ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size);

- return ret;
-fail:
- memset(dst, 0, size);
+ if (unlikely(ret < 0))
+ memset(dst, 0, size);
return ret;
}

@@ -1011,16 +997,20 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_probe_read_user:
return &bpf_probe_read_user_proto;
case BPF_FUNC_probe_read_kernel:
- return &bpf_probe_read_kernel_proto;
+ return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+ NULL : &bpf_probe_read_kernel_proto;
case BPF_FUNC_probe_read_user_str:
return &bpf_probe_read_user_str_proto;
case BPF_FUNC_probe_read_kernel_str:
- return &bpf_probe_read_kernel_str_proto;
+ return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+ NULL : &bpf_probe_read_kernel_str_proto;
#ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
case BPF_FUNC_probe_read:
- return &bpf_probe_read_compat_proto;
+ return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+ NULL : &bpf_probe_read_compat_proto;
case BPF_FUNC_probe_read_str:
- return &bpf_probe_read_compat_str_proto;
+ return security_locked_down(LOCKDOWN_BPF_READ) < 0 ?
+ NULL : &bpf_probe_read_compat_str_proto;
#endif
#ifdef CONFIG_CGROUPS
case BPF_FUNC_get_current_cgroup_id:
--
2.27.0


--------------629630DF919F6F51E0188473--