Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)

From: Topi Miettinen
Date: Fri Apr 15 2022 - 16:01:10 EST


On 14.4.2022 21.52, Kees Cook wrote:
On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but
subsequently changed to read-only. Therefore the filter simply rejects
any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
support (Branch Target Identification), the dynamic loader cannot change
an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
For libraries, it can resort to unmapping and re-mapping but for the
main executable it does not have a file descriptor. The original bug
report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
for libraries - [3].

Right, so, the systemd filter is a big hammer solution for the kernel
not having a very easy way to provide W^X mapping protections to
userspace. There's stuff in SELinux, and there have been several
attempts[1] at other LSMs to do it too, but nothing stuck.

Given the filter, and the implementation of how to enable BTI, I see two
solutions:

- provide a way to do W^X so systemd can implement the feature differently
- provide a way to turn on BTI separate from mprotect to bypass the filter

I would agree, the latter seems like the greater hack, so I welcome
this RFC, though I think it might need to explore a bit of the feature
space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
it risks being too narrowly implemented. For example, playing well with
JITs should be part of the design, and will likely need some kind of
ELF flags and/or "sealing" mode, and to handle the vma alias case as
Jann Horn pointed out[2].

Another interesting case from 2006 by Ulrich Drepper is to use a temporary file and map it twice, once with PROT_WRITE and once with PROT_EXEC [1]. This isn't possible if the mount flags of the file systems are also in line with W^X principle. System services (unlike user apps) typically don't use /tmp nor /dev/shm (mounted with "rw,exec"). With systemd a simple file system W^X policy can be implemented for a service for example with NoExecPaths=/ ExecPaths=/usr ReadOnlyPaths=/usr. In-kernel MDWE probably could look beyond file descriptors and check if the mount flags of the file system containing the file being mmap()ed agree with W^X. The use cases for system services and user apps may be different: system services are often compatible with maximum hardening, while user apps may need various compatibility solutions if they use JIT, trampolines or FFI and access to W+X file systems may be also needed.

-Topi

[1] https://akkadia.org/drepper/selinux-mem.html

Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
flag, inherited on fork() and execve(). The kernel tracks a previously
writeable mapping via a new VM_WAS_WRITE flag (64-bit only
architectures). I went for a personality flag by analogy with the
READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
we don't want more personality flags. A minor downside with the
personality flag is that there is no way for the user to query which
flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
this.

My instinct here is to use a prctl(), which maps to other kinds of modern
inherited state (like no_new_privs).

Posting this as an RFC to start a discussion and cc'ing some of the
systemd guys and those involved in the earlier thread around the glibc
workaround for dynamic libraries [4]. Before thinking of upstreaming
this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
BPF filter with the in-kernel one.

Thanks,

Catalin

[1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
[3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@xxxxxxx

So, yes, let's do it. It's long long overdue in the kernel. :)

-Kees

[1] https://github.com/KSPP/linux/issues/32
[2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611


Catalin Marinas (4):
mm: Track previously writeable vma permission
mm, personality: Implement memory-deny-write-execute as a personality
flag
fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
flag
arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC

arch/arm64/Kconfig | 1 +
fs/binfmt_elf.c | 2 ++
include/linux/mm.h | 6 ++++++
include/linux/mman.h | 18 +++++++++++++++++-
include/uapi/linux/binfmts.h | 4 ++++
include/uapi/linux/personality.h | 1 +
mm/Kconfig | 4 ++++
mm/mmap.c | 3 +++
mm/mprotect.c | 5 +++++
9 files changed, 43 insertions(+), 1 deletion(-)