Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)

From: Topi Miettinen
Date: Wed Apr 13 2022 - 14:39:48 EST


On 13.4.2022 16.49, Catalin Marinas wrote:
Hi,

The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but
subsequently changed to read-only. Therefore the filter simply rejects
any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
support (Branch Target Identification), the dynamic loader cannot change
an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
For libraries, it can resort to unmapping and re-mapping but for the
main executable it does not have a file descriptor. The original bug
report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
for libraries - [3].

Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
flag, inherited on fork() and execve(). The kernel tracks a previously
writeable mapping via a new VM_WAS_WRITE flag (64-bit only
architectures). I went for a personality flag by analogy with the
READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
we don't want more personality flags. A minor downside with the
personality flag is that there is no way for the user to query which
flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
this.

With systemd there's a BPF construct to block personality changes (LockPersonality=yes) but I think prctl() would be easier to lock down irrevocably.

Requiring or implying NoNewPrivileges could prevent nasty surprises from set-uid Python programs which happen to use FFI.

Posting this as an RFC to start a discussion and cc'ing some of the
systemd guys and those involved in the earlier thread around the glibc
workaround for dynamic libraries [4]. Before thinking of upstreaming
this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
BPF filter with the in-kernel one.

As the author of this feature in systemd (also similar feature in Firejail), I'd highly prefer in-kernel version to BPF protection. I'd definitely also want to use this in place of BPF on x86_64 and other arches too.

In-kernel version would probably allow covering pretty easily this case (maybe it already does):

fd = memfd_create(...);
write(fd, malicious_code, sizeof(malicious_code));
mmap(..., PROT_EXEC, ..., fd);

Other memory W^X implementations include S.A.R.A [1] and SELinux EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be useful additions here too (or future extensions if you prefer).

-Topi

[1] https://smeso.it/sara/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787