Re: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall

From: Deepak Gupta
Date: Thu Feb 23 2023 - 16:20:34 EST


On Wed, Feb 22, 2023 at 5:11 PM Edgecombe, Rick P
<rick.p.edgecombe@xxxxxxxxx> wrote:
>
> On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote:
> > On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
> > > When operating with shadow stacks enabled, the kernel will
> > > automatically
> > > allocate shadow stacks for new threads, however in some cases
> > > userspace
> > > will need additional shadow stacks. The main example of this is the
> > > ucontext family of functions, which require userspace allocating
> > > and
> > > pivoting to userspace managed stacks.
> > >
> > > Unlike most other user memory permissions, shadow stacks need to be
> > > provisioned with special data in order to be useful. They need to
> > > be setup
> > > with a restore token so that userspace can pivot to them via the
> > > RSTORSSP
> > > instruction. But, the security design of shadow stack's is that
> > > they
> > > should not be written to except in limited circumstances. This
> > > presents a
> > > problem for userspace, as to how userspace can provision this
> > > special
> > > data, without allowing for the shadow stack to be generally
> > > writable.
> > >
> > > Previously, a new PROT_SHADOW_STACK was attempted, which could be
> > > mprotect()ed from RW permissions after the data was provisioned.
> > > This was
> > > found to not be secure enough, as other thread's could write to the
> > > shadow stack during the writable window.
> > >
> > > The kernel can use a special instruction, WRUSS, to write directly
> > > to
> > > userspace shadow stacks. So the solution can be that memory can be
> > > mapped
> > > as shadow stack permissions from the beginning (never generally
> > > writable
> > > in userspace), and the kernel itself can write the restore token.
> > >
> > > First, a new madvise() flag was explored, which could operate on
> > > the
> > > PROT_SHADOW_STACK memory. This had a couple downsides:
> > > 1. Extra checks were needed in mprotect() to prevent writable
> > > memory from
> > > ever becoming PROT_SHADOW_STACK.
> > > 2. Extra checks/vma state were needed in the new madvise() to
> > > prevent
> > > restore tokens being written into the middle of pre-used shadow
> > > stacks.
> > > It is ideal to prevent restore tokens being added at arbitrary
> > > locations, so the check was to make sure the shadow stack had
> > > never been
> > > written to.
> > > 3. It stood out from the rest of the madvise flags, as more of
> > > direct
> > > action than a hint at future desired behavior.
> > >
> > > So rather than repurpose two existing syscalls (mmap, madvise) that
> > > don't
> > > quite fit, just implement a new map_shadow_stack syscall to allow
> > > userspace to map and setup new shadow stacks in one step. While
> > > ucontext
> > > is the primary motivator, userspace may have other unforeseen
> > > reasons to
> > > setup it's own shadow stacks using the WRSS instruction. Towards
> > > this
> > > provide a flag so that stacks can be optionally setup securely for
> > > the
> > > common case of ucontext without enabling WRSS. Or potentially have
> > > the
> > > kernel set up the shadow stack in some new way.
> >
> > Was following ever attempted?
> >
> > void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
> > - limit PROT_SHADOWSTACK protection flag to only mmap (and thus
> > mprotect can't
> > convert memory from shadow stack to non-shadow stack type or vice
> > versa)
> > - limit PROT_SHADOWSTACK protection flag to anonymous memory only.
> > - top level mmap handler to put a token at the base using WRUSS if
> > prot == PROT_SHADOWSTACK
> >
> > You essentially would get shadow stack manufacturing with existing
> > (single) syscall.
> > Acting a bit selfish here, this allows other architectures as well to
> > re-use this and
> > do their own implementation of mapping and placing the token at the
> > base.
>
> Yes, I looked at it. You end up with a pile of checks and hooks added
> to mmap() and various other places as you outline. We also now have the
> MAP_ABOVE4G limitation for x86 shadow stack that would need checking
> for too. It's not exactly a clean fit. Then, callers would have to pass
> special x86 flags in anyway.

riscv has mechanisms using which a 32bit app can run on 64bit kernel.
So technically if there are 32bit and 64bit code in address space,
MAP_ABOVE4G could be useful.
Although I am not sure (or aware of) if there are such requirement
from app/developers yet (to guarantee address mapping above 4G)

But I see this as orthogonal to memory protection flags.

>
> It doesn't seem like the complexity of the checks is worth saving the
> tiny syscall. Is there some reason why riscv can't use the same syscall
> stub? It doesn't need to live forever in x86 code. Not sure what the
> savings are for riscv of the mmap+checks approach are either...

I don't see a lot of extra complexity here.
If `mprotect` and friends don't know about `PROT_SHADOWSTACK`, they'll
just fail by default (which is desired)

It's only `mmap` that needs to be enlightened. And it can just pass
`VMA_SHADOW_STACK` to `do_mmap` if input is `PROT_SHADOWSTACK`.

Adding a syscall just for mapping shadow stack is weird when it can be
solved with existing system calls.
As you say in your response below, it would be good to have such a
syscall which serve larger purposes (e.g. provisioning special
security-type memory)

arm64's memory tagging is one such example. Not exactly security-type
memory (but eventual application is security for this feature) .
It adds extra meaning to virtual addresses (i.e. an address has tags).
arm64 went about using a protection flag `PROT_MTE` instead of a
special system call.

Being said that since this patch has gone through multiple revisions
and I am new to the party. If others dont have issues on this special
system call,
I think it's fine then. In case of riscv I can choose to use this
mechanism or go via arm's route to define PROT_SHADOWSTACK which is
arch specific.

>
> I did wonder if there could be some sort of more general syscall for
> mapping and provisioning special security-type memory. But we probably
> need a few more non-shadow stack examples to get an idea of what that
> would look like.

As I mentioned memory tagging and thus PROT_MTE is already such a use
case which uses `mmap/mprotect` protection flags to designate special
meaning to a virtual address.