Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk

From: Edgecombe, Rick P
Date: Tue Jan 24 2023 - 13:14:43 EST


On Tue, 2023-01-24 at 17:24 +0100, David Hildenbrand wrote:
> On 23.01.23 21:47, Edgecombe, Rick P wrote:
> > On Mon, 2023-01-23 at 10:50 +0100, David Hildenbrand wrote:
> > > On 19.01.23 22:22, Rick Edgecombe wrote:
> > > > The x86 Control-flow Enforcement Technology (CET) feature
> > > > includes
> > > > a new
> > > > type of memory called shadow stack. This shadow stack memory
> > > > has
> > > > some
> > > > unusual properties, which requires some core mm changes to
> > > > function
> > > > properly.
> > > >
> > > > Since shadow stack memory can be changed from userspace, is
> > > > both
> > > > VM_SHADOW_STACK and VM_WRITE. But it should not be made
> > > > conventionally
> > > > writable (i.e. pte_mkwrite()). So some code that calls
> > > > pte_mkwrite() needs
> > > > to be adjusted.
> > > >
> > > > One such case is when memory is made writable without an actual
> > > > write
> > > > fault. This happens in some mprotect operations, and also
> > > > prot_numa
> > > > faults.
> > > > In both cases code checks whether it should be made
> > > > (conventionally)
> > > > writable by calling vma_wants_manual_pte_write_upgrade().
> > > >
> > > > One way to fix this would be have code actually check if memory
> > > > is
> > > > also
> > > > VM_SHADOW_STACK and in that case call pte_mkwrite_shstk(). But
> > > > since
> > > > most memory won't be shadow stack, just have simpler logic and
> > > > skip
> > > > this
> > > > optimization by changing vma_wants_manual_pte_write_upgrade()
> > > > to
> > > > not
> > > > return true for VM_SHADOW_STACK_MEMORY. This will simply handle
> > > > all
> > > > cases of this type.
> > > >
> > > > Cc: David Hildenbrand <david@xxxxxxxxxx>
> > > > Tested-by: Pengfei Xu <pengfei.xu@xxxxxxxxx>
> > > > Tested-by: John Allen <john.allen@xxxxxxx>
> > > > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@xxxxxxxxx>
> > > > Reviewed-by: Kirill A. Shutemov <
> > > > kirill.shutemov@xxxxxxxxxxxxxxx>
> > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
> > > > ---
> > >
> > > Instead of having these x86-shadow stack details all over the MM
> > > space,
> > > was the option explored to handle this more in arch specific
> > > code?
> > >
> > > IIUC, one way to get it working would be
> > >
> > > 1) Have a SW "shadowstack" PTE flag.
> > > 2) Have an "SW-dirty" PTE flag, to store "dirty=1" when
> > > "write=0".
> >
> > I don't think that idea came up. So vma->vm_page_prot would have
> > the SW
> > shadow stack flag for VM_SHADOW_STACK, and pte_mkwrite() could do
> > Write=0,Dirty=1 part. It seems like it should work.
> >
>
> Right, if we include it in vma->vm_page_prot, we'd immediately let
> mk_pte() just handle that.
>
> Otherwise, we'd have to refactor e.g., mk_pte() to consume a vma
> instead
> of the vma->vm_page_prot. Let's see if we can avoid that for now.
>
> > >
> > > pte_mkwrite(), pte_write(), pte_dirty ... can then make decisions
> > > based
> > > on the "shadowstack" PTE flag and hide all these details from
> > > core-
> > > mm.
> > >
> > > When mapping a shadowstack page (new page, migration, swapin,
> > > ...),
> > > which can be obtained by looking at the VMA flags, the first
> > > thing
> > > you'd
> > > do is set the "shadowstack" PTE flag.
> >
> > I guess the downside is that it uses an extra software bit. But the
> > other positive is that it's less error prone, so that someone
> > writing
> > core-mm code won't introduce a change that makes shadow stack VMAs
> > Write=1 if they don't know to also check for VM_SHADOW_STACK.
>
> Right. And I think this mimics the what I would have expected HW to
> provide: a dedicated HW bit, not somehow mangling this into semantics
> of
> existing bits.

Yea.

>
> Roughly speaking: if we abstract it that way and get all of the "how
> to
> set it writable now?" out of core-MM, it not only is cleaner and
> less
> error prone, it might even allow other architectures that implement
> something comparable (e.g., using a dedicated HW bit) to actually
> reuse
> some of that work. Otherwise most of that "shstk" is really just x86
> specific ...
>
> I guess the only cases we have to special case would be page pinning
> code where pte_write() would indicate that the PTE is writable (well,
> it
> is, just not by "ordinary CPU instruction" context directly): but you
> do
> that already, so ... :)
>
> Sorry for stumbling over that this late, I only started looking into
> this when you CCed me on that one patch.

Sorry for not calling more attention to it earlier. Appreciate your
comments.

Previously versions of this series had changed some of these
pte_mkwrite() calls to maybe_mkwrite(), which of course takes a vma.
This way an x86 implementation could use the VM_SHADOW_STACK vma flag
to decide between pte_mkwrite() and pte_mkwrite_shstk(). The feedback
was that in some of these code paths "maybe" isn't really an option, it
*needs* to make it writable. Even though the logic was the same, the
name of the function made it look wrong.

But another option could be to change pte_mkwrite() to take a vma. This
would save using another software bit on x86, but instead requires a
small change to each arch's pte_mkwrite().

x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), but
maybe it could additionally warn if the vma is not writable. It also
seems more aligned with your changes to stop taking hints from PTE bits
and just look at the VMA? (I'm thinking about the dropping of the dirty
check in GUP and dropping pte_saved_write())