Re: [PATCH 0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

From: Mark Rutland
Date: Wed Nov 04 2020 - 10:20:41 EST


On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote:
> On 4.11.2020 11.29, Florian Weimer wrote:
> > * Will Deacon:
> >
> > > Is there real value in this seccomp filter if it only looks at mprotect(),
> > > or was it just implemented because it's easy to do and sounds like a good
> > > idea?
> >
> > It seems bogus to me. Everyone will just create alias mappings instead,
> > just like they did for the similar SELinux feature. See “Example code
> > to avoid execmem violations” in:
> >
> > <https://www.akkadia.org/drepper/selinux-mem.html>
>
> Also note "But this is very dangerous: programs should never use memory
> regions which are writable and executable at the same time. Assuming that it
> is really necessary to generate executable code while the program runs the
> method employed should be reconsidered."

Sure, and to be clear we're not trying to violate the "at the same time"
property. We do not want to permit simultaneous PROT_WRITE and PROT_EXEC
at any instant in time. What we're asking is to not block changing
permissions to PROT_EXEC in the absence of PROT_WRITE.

I think that the goal of preventing WRITE -> EXEC transitions for some
memory is sane, but I think the existing kernel primitives available to
systemd don't allow us to do that in a robust way because we don't have
all the relevant state tracked and accessible, and the existing approach
gets in the way of doing the right thing for other mitigations.

Consequently I think it would be better going forward to add a more
robust (kernel) mechanism for enforcement that can distinguish
WRITE->EXEC from EXEC->EXEC+BTI, and e.g. can be used to forbid aliasing
mappings with differing W/X permissions. Then userspace could eventually
transition over to that and get /stronger/ protection while permitting
the BTI case we'd like to work now.

> If a service legitimately needs executable and writable mappings (due to
> JIT, trampolines etc), it's easy to disable the filter whenever really
> needed with "MemoryDenyWriteExecute=no" (which is the default) in case of
> systemd or a TE rule like "allow type_t self:process { execmem };" for
> SELinux. But this shouldn't be the default case, since there are many
> services which don't need W&X.
>
> I'd also question what is the value of BTI if it can be easily circumvented
> by removing PROT_BTI with mprotect()?

I agree that turning BTI off is a concern, and to that end I'd like to
add an enforcement mechanism whereby we could prevent that (ideally the
same mechanism by which we could prevent WRITE -> EXEC transitions).

But, as with all things it's a matter of degree. MDWE and BTI are both
hurdles to an adversary, but neither are absolutes and there are
approaches to bypass either. By the time someone's issuing mprotect()
with an arbitrary VA and/or prot, they are liable to have been able to
do the same with mmap() and circumvent MDWE.

I'd really like to not have BTI silently disabled in order to work with
MDWE, because the risk is that it gets silently disabled elsewhere. The
risk of the changing the kernel to enable BTI for a binary is not well
known since we don't control other peoples libraries that might end up
not being compatible somehow with that. The risk of disabling a portion
of the MDWE protections seems to be the least out of the options we have
available, as unfortunate as it seems, and I think we can come up with a
better MDWE approach going forward.

Thanks,
Mark.