Re: [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions

From: Jarkko Sakkinen
Date: Mon Mar 07 2022 - 12:10:55 EST


On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
> === Summary ===
>
> An SGX VMA can only be created if its permissions are the same or
> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
> creation this same rule is again enforced by the page fault handler:
> faulted enclave pages are required to have equal or more relaxed
> EPCM permissions than the VMA permissions.
>
> On SGX1 systems the additional enforcement in the page fault handler
> is redundant and on SGX2 systems it incorrectly prevents access.
> On SGX1 systems it is unnecessary to repeat the enforcement of the
> permission rule. The rule used during original VMA creation will
> ensure that any access attempt will use correct permissions.
> With SGX2 the EPCM permissions of a page can change after VMA
> creation resulting in the VMA permissions potentially being more
> relaxed than the EPCM permissions and the page fault handler
> incorrectly blocking valid access attempts.
>
> Enable the VMA's pages to remain accessible while ensuring that
> the PTEs are installed to match the EPCM permissions but not be
> more relaxed than the VMA permissions.
>
> === Full Changelog ===
>
> An SGX enclave is an area of memory where parts of an application
> can reside. First an enclave is created and loaded (from
> non-enclave memory) with the code and data of an application,
> then user space can map (mmap()) the enclave memory to
> be able to enter the enclave at its defined entry points for
> execution within it.
>
> The hardware maintains a secure structure, the Enclave Page Cache Map
> (EPCM), that tracks the contents of the enclave. Of interest here is
> its tracking of the enclave page permissions. When a page is loaded
> into the enclave its permissions are specified and recorded in the
> EPCM. In parallel the kernel maintains permissions within the
> page table entries (PTEs) and the rule is that PTE permissions
> are not allowed to be more relaxed than the EPCM permissions.
>
> A new mapping (mmap()) of enclave memory can only succeed if the
> mapping has the same or weaker permissions than the permissions that
> were vetted during enclave creation. This is enforced by
> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
> paths. This rule remains.
>
> One feature of SGX2 is to support the modification of EPCM permissions
> after enclave initialization. Enclave pages may thus already be part
> of a VMA at the time their EPCM permissions are changed resulting
> in the VMA's permissions potentially being more relaxed than the EPCM
> permissions.
>
> Allow permissions of existing VMAs to be more relaxed than EPCM
> permissions in preparation for dynamic EPCM permission changes
> made possible in SGX2. New VMAs that attempt to have more relaxed
> permissions than EPCM permissions continue to be unsupported.
>
> Reasons why permissions of existing VMAs are allowed to be more relaxed
> than EPCM permissions instead of dynamically changing VMA permissions
> when EPCM permissions change are:
> 1) Changing VMA permissions involve splitting VMAs which is an
> operation that can fail. Additionally changing EPCM permissions of
> a range of pages could also fail on any of the pages involved.
> Handling these error cases causes problems. For example, if an
> EPCM permission change fails and the VMA has already been split
> then it is not possible to undo the VMA split nor possible to
> undo the EPCM permission changes that did succeed before the
> failure.
> 2) The kernel has little insight into the user space where EPCM
> permissions are controlled from. For example, a RW page may
> be made RO just before it is made RX and splitting the VMAs
> while the VMAs may change soon is unnecessary.
>
> Remove the extra permission check called on a page fault
> (vm_operations_struct->fault) or during debugging
> (vm_operations_struct->access) when loading the enclave page from swap
> that ensures that the VMA permissions are not more relaxed than the
> EPCM permissions. Since a VMA could only exist if it passed the
> original permission checks during mmap() and a VMA may indeed
> have more relaxed permissions than the EPCM permissions this extra
> permission check is no longer appropriate.
>
> With the permission check removed, ensure that PTEs do
> not blindly inherit the VMA permissions but instead the permissions
> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
> and enclave perspective) are installed with the writable bit set,
> reducing the need for this additional flow to the permission mismatch
> cases handled next.
>
> Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
> ---
> Changes since V1:
> - Reword commit message (Jarkko).
> - Use "relax" instead of "exceed" when referring to permissions (Dave).
> - Add snippet to Documentation/x86/sgx.rst that highlights the
> relationship between VMA, EPCM, and PTE permissions on SGX
> systems (Andy).
>
> Documentation/x86/sgx.rst | 10 +++++++++
> arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
> 2 files changed, 30 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> index 89ff924b1480..5659932728a5 100644
> --- a/Documentation/x86/sgx.rst
> +++ b/Documentation/x86/sgx.rst
> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
> * PTEs are installed to match the EPCM permissions, but not be more
> relaxed than the VMA permissions.
>
> +On systems supporting SGX2 EPCM permissions may change while the
> +enclave page belongs to a VMA without impacting the VMA permissions.
> +This means that a running VMA may appear to allow access to an enclave
> +page that is not allowed by its EPCM permissions. For example, when an
> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
> +subsequently changed to have read-only EPCM permissions. The kernel
> +continues to maintain correct access to the enclave page through the
> +PTE that will ensure that only access allowed by both the VMA
> +and EPCM permissions are permitted.
> +
> Application interface
> =====================
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 48afe96ae0f0..b6105d9e7c46 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
> }
>
> static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> - unsigned long addr,
> - unsigned long vm_flags)
> + unsigned long addr)
> {
> - unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> struct sgx_epc_page *epc_page;
> struct sgx_encl_page *entry;
>
> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> if (!entry)
> return ERR_PTR(-EFAULT);
>
> - /*
> - * Verify that the faulted page has equal or higher build time
> - * permissions than the VMA permissions (i.e. the subset of {VM_READ,
> - * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
> - */
> - if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
> - return ERR_PTR(-EFAULT);
> -
> /* Entry successfully located. */
> if (entry->epc_page) {
> if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> {
> unsigned long addr = (unsigned long)vmf->address;
> struct vm_area_struct *vma = vmf->vma;
> + unsigned long page_prot_bits;
> struct sgx_encl_page *entry;
> + unsigned long vm_prot_bits;
> unsigned long phys_addr;
> struct sgx_encl *encl;
> vm_fault_t ret;
> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>
> mutex_lock(&encl->lock);
>
> - entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
> + entry = sgx_encl_load_page(encl, addr);
> if (IS_ERR(entry)) {
> mutex_unlock(&encl->lock);

> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>
> phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
>
> - ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
> + /*
> + * Insert PTE to match the EPCM page permissions ensured to not
> + * exceed the VMA permissions.
> + */
> + vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> + page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
> + /*
> + * Add VM_SHARED so that PTE is made writable right away if VMA
> + * and EPCM are writable (no COW in SGX).
> + */
> + page_prot_bits |= (vma->vm_flags & VM_SHARED);
> + ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
> + vm_get_page_prot(page_prot_bits));
> if (ret != VM_FAULT_NOPAGE) {
> mutex_unlock(&encl->lock);
>
> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
> * Load an enclave page to EPC if required, and take encl->lock.
> */
> static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> - unsigned long addr,
> - unsigned long vm_flags)
> + unsigned long addr)
> {
> struct sgx_encl_page *entry;
>
> for ( ; ; ) {
> mutex_lock(&encl->lock);
>
> - entry = sgx_encl_load_page(encl, addr, vm_flags);
> + entry = sgx_encl_load_page(encl, addr);
> if (PTR_ERR(entry) != -EBUSY)
> break;
>
> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
> return -EFAULT;
>
> for (i = 0; i < len; i += cnt) {
> - entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
> - vma->vm_flags);
> + entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
> if (IS_ERR(entry)) {
> ret = PTR_ERR(entry);
> break;
> --
> 2.25.1
>

If you unconditionally set vm_max_prot_bits to RWX for dynamically created
pags, you would not need to do this.

These patches could be then safely dropped then:

- [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions
- [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
- [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions

And that would also keep full ABI compatibility without exceptions to the
existing mainline code.

BR, Jarkko