Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit

From: Andy Lutomirski
Date: Wed Sep 11 2019 - 14:03:37 EST


On Wed, Sep 11, 2019 at 12:49 AM Thomas HellstrÃm (VMware)
<thomas_os@xxxxxxxxxxxx> wrote:
>
> Hi, Andy.
>
> On 9/11/19 6:18 AM, Andy Lutomirski wrote:

> >
> > As a for-real example, take a look at arch/x86/entry/vdso/vma.c. The
> > "vvar" VMA contains multiple pages that are backed by different types
> > of memory. There's a page of ordinary kernel memory. Historically
> > there was a page of genuine MMIO memory, but that's gone now. If the
> > system is a SEV guest with pvclock enabled, then there's a page of
> > decrypted memory. They all share a VMA, they're instantiated in
> > .fault, and there is no hackery involved. Look at vvar_fault().
>
> So this is conceptually identical to TTM. The difference is that it uses
> vmf_insert_pfn_prot() instead of vmf_insert_mixed() with the vma hack.
> Had there been a vmf_insert_mixed_prot(), the hack in TTM wouldn't be
> needed. I guess providing a vmf_insert_mixed_prot() is a to-do for me to
> pick up.

:)

>
> Having said that, the code you point out is as fragile and suffers from
> the same shortcomings as TTM since
> a) Running things like split_huge_pmd() that takes the vm_page_prot and
> applies it to new PTEs will make things break, (although probably never
> applicable in this case).

Hmm. There are no vvar huge pages, so this is okay.

I wonder how hard it would be to change the huge page splitting code
to copy the encryption and cacheability bits from the source entry
instead of getting them from vm_page_prot, at least in the cases
relevant to VM_MIXEDMAP and VM_PFNMAP.

> b) Running mprotect() on that VMA will only work if sme_me_mask is part
> of _PAGE_CHG_MASK (which is addressed in a reliable way in my recent
> patchset), otherwise, the encryption setting will be overwritten.
>

Indeed. Thanks for the fix!

>
> >> We could probably get away with a WRITE_ONCE() update of the
> >> vm_page_prot before taking the page table lock since
> >>
> >> a) We're locking out all other writers.
> >> b) We can't race with another fault to the same vma since we hold an
> >> address space lock ("buffer object reservation")
> >> c) When we need to update there are no valid page table entries in the
> >> vma, since it only happens directly after mmap(), or after an
> >> unmap_mapping_range() with the same address space lock. When another
> >> reader (for example split_huge_pmd()) sees a valid page table entry, it
> >> also sees the new page protection and things are fine.
> >>
> >> But that would really be a special case. To solve this properly we'd
> >> probably need an additional lock to protect the vm_flags and
> >> vm_page_prot, taken after mmap_sem and i_mmap_lock.
> >>
> > This is all horrible IMO.
>
> I'd prefer to call it fragile and potentially troublesome to maintain.

Fair enough.

>
> That distinction is important because if it ever comes to a choice
> between adding a new lock to protect vm_page_prot (and consequently slow
> down the whole vm system) and using the WRITE_ONCE solution in TTM, we
> should know what the implications are. As it turns out previous choices
> in this area actually seem to have opted for the lockless WRITE_ONCE /
> READ_ONCE / ptl solution. See __split_huge_pmd_locked() and
> vma_set_page_prot().

I think it would be even better if the whole thing could work without
ever writing to vm_page_prot. This would be a requirement for vvar in
the unlikely event that the vvar vma ever supported splittable huge
pages. Fortunately, that seems unlikely :)