Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

From: Lokesh Gidra
Date: Fri Sep 04 2020 - 20:36:23 EST


On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote:
> > There has been an emphasis that Android is probably the only user for
> > the restriction of userfaults from kernel-space and that it wouldn’t
> > be useful anywhere else. I humbly disagree! There are various areas
> > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely
> > user-space setting. Basically, any lazy, on-demand,
>
> For the record what I said is quoted below
> https://lkml.kernel.org/r/20200520194804.GJ26186@xxxxxxxxxx :
>
> """It all boils down of how peculiar it is to be able to leverage only
> the acceleration [..] Right now there's a single user that can cope
> with that limitation [..] If there will be more users [..] it'd be
> fine to add a value "2" later."""
>
> Specifically I never said "that it wouldn’t be useful anywhere else.".
>
Thanks a lot for clarifying.

> Also I'm only arguing about the sysctl visible kABI change in patch
> 2/2: the flag passed as parameter to the syscall in patch 1/2 is all
> great, because seccomp needs it in the scalar parameter of the syscall
> to implement a filter equivalent to your sysctl "2" policy with only
> patch 1/2 applied.
>
> I've two more questions now:
>
> 1) why don't you enforce the block of kernel initiated faults with
> seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just
> an optimization to remove a few instructions per syscall in the bpf
> execution of Android unprivileged apps? You should block a lot of
> other syscalls by default to all unprivileged processes, including
> vmsplice.
>
> In other words if it's just for Android, why can't Android solve it
> with only patch 1/2 by tweaking the seccomp filter?

I would let Nick (nnk@) and Jeff (jeffv@) respond to this.

The previous responses from both of them on this email thread
(https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@xxxxxxxxxxxxxx/
and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@xxxxxxxxxxxxxx/)
suggest that the performance overhead of seccomp-bpf is too much. Kees
also objected to it
(https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/)

I'm not familiar with how seccomp-bpf works. All that I can add here
is that userfaultfd syscall is usually not invoked in a performance
critical code path. So, if the performance overhead of seccomp-bpf (if
enabled) is observed on all syscalls originating from a process, then
I'd say patch 2/2 is essential. Otherwise, it should be ok to let
seccomp perform the same functionality instead.

>
> 2) given that Android is secure enough with the sysctl at value 2, why
> should we even retain the current sysctl 0 semantics? Why can't
> more secure systems just use seccomp and block userfaultfd, as it
> is already happens by default in the podman default seccomp
> whitelist (for those containers that don't define a new json
> whitelist in the OCI schema)? Shouldn't we focus our energy in
> making containers more secure by preventing the OCI schema of a
> random container to re-enable userfaultfd in the container seccomp
> filter instead of trying to solve this with a global sysctl?
>
> What's missing in my view is a kubernetes hard allowlist/denylist
> that cannot be overridden with the OCI schema in case people has
> the bad idea of running containers downloaded from a not fully
> trusted source, without adding virt isolation and that's an
> userland problem to be solved in the container runtime, not a
> kernel issue. Then you'd just add userfaultfd to the json of the
> k8s hard seccomp denylist instead of going around tweaking sysctl.
>
> What's your take in changing your 2/2 patch to just replace value "0"
> and avoid introducing a new value "2"?

SGTM. Disabling uffd completely for unprivileged processes can be
achieved either using seccomp-bpf, or via SELinux, once the following
patch series is upstreamed
https://lore.kernel.org/lkml/20200827063522.2563293-1-lokeshgidra@xxxxxxxxxx/

>
> The value "0" was motivated by the concern that uffd can enlarge the
> race window for use after free by providing one more additional way to
> block kernel faults, but value "2" is already enough to solve that
> concern completely and it'll be the default on all Android.
>
> In other words by adding "2" you're effectively doing a more
> finegrined and more optimal implementation of "0" that remains useful
> and available to unprivileged apps and it already resolves all
> "robustness against side effects other kernel bugs" concerns. Clearly
> "0" is even more secure statistically but that would apply to every
> other syscall including vmsplice, and there's no
> /proc/sys/vm/unprivileged_vmsplice sysctl out there.
>
> The next issue we have now is with the pipe mutex (which is not a
> major concern but we need to solve it somehow for correctness). So I
> wonder if should make the default value to be "0" (or "2" if think we
> should not replace "0") and to allow only user initiated faults by
> default.
>
> Changing the sysctl default to be 0, will make live migration fail to
> switch to postcopy which will be (unnoticeable to the guest), instead
> of risking the VM to be killed because of network latency
> outlier. Then we wouldn't need to change the pipe code at all.
>
SGTM. I can change the default value to '0' (or '2') in the next
revision of patch 2/2, unless somebody objects to this.

> Alternatively we could still fix the pipe code so it runs better (but
> it'll be more complex) or to disable uffd faults only in the pipe
> code.
>
> One thing to keep in mind is that if we change the default, then
> hypervisor hosts running QEMU would need to set:
>
> vm.userfaultfd = 1
>
> in /etc/sysctl.conf if postcopy live migration is required, that's not
> particularly concerning constraint for qemu (there are likely other
> tweaks required and it looks less risky than an arbitrary timeout
> which could kill the VM: if the above is forgotten the postcopy live
> migration won't even start and it'll be unnoticeable to the guest).
>
> The main concern really are future apps that may want to use uffd for
> kernel initiated faults won't be allowed to do so by default anymore,
> those apps will be heavily incentivated to use bounce buffers before
> passing data to syscalls, similarly to the current use case of patch 2/2.
>
> Comments welcome,
> Andrea
>
> PS. Another usage of uffd that remains possible without privilege with
> the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV
> acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib
> will remain possible without requiring extra privileges, by adding a
> UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS
> (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be
> indistinguishable to UFFDIO_ZEROPAGE but it will solve the last
> performance bottleneck by avoiding a wrprotect fault after the
> allocation and it will be THP capable too). Memory will be freed with
> MADV_DONTNEED, without ever having to call mmap/mumap. It could move
> memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP
> which already exists.
>
UFFDIO_POPULATE sounds like a really useful feature. I don't see it in
the kernel yet. Is there a patch under work on this? If so, kindly
share.