Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

From: Nathaniel McCallum
Date: Wed Mar 02 2022 - 20:14:16 EST


On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre
<reinette.chatre@xxxxxxxxx> wrote:
>
> Hi Nathaniel,
>
> On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
> > Perhaps it would be better for us to have a shared understanding on
> > how the patches as posted are supposed to work in the most common
> > cases? I'm thinking here of projects such as Enarx, Gramine and
> > Occulum, which all have a similar process. Namely they execute an
> > executable (called exec in the below chart) which has things like
> > syscalls handled by a shim. These two components (shim and exec) are
> > supported by a non-enclave userspace runtime. Given this common
> > architectural pattern, this is how I understand adding pages via an
> > exec call to mmap() to work.
> >
> > https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
> >
> > This only covers adding RW pages. I haven't even tackled permission
> > changes yet. Is that understanding correct? If not, please provide an
> > alternative sequence diagram to explain how you expect this to be
> > used.
>
> Please find my attempt linked below:
>
> https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH
>
> The changes include:
> * Move mmap() to occur before attempting EACCEPT on the addresses. This is
> required for EACCEPT (as well as any subsequent access from within the enclave)
> to be able to access the pages.
> * Remove AEX[1] to the runtime within the loop. After EAUG returns execution
> will return to the instruction pointer that triggered the #PF, EACCEPT,
> this will cause the EACCEPT to be run again, this time succeeding.
>
> This is based on the implementation within this series. When supporting
> the new ioctl() requested by Jarkko there will be an additional ioctl()
> required before the loop.

https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L

My comments below correspond to the arrow numbers in the diagram.

2. When the runtime receives the AEX, it doesn't have enough knowledge
to know whether or not to ask the kernel for an mmap(). So it has to
reenter the shim.

3. The shim has to handle the syscall instruction routing it to the
enclave's memory management subsystem.

4. The shim has to do bookkeeping and decide if additional pages are
even needed. If pages are already allocated, for example, it can skip
directly to step 13. However, if modifications are needed, it will go
to steps 5-12.

5-12. This is the part that represents new code from the kernel's
perspective for SGX2. It is also in a performance critical path and
should be evaluated with greater scrutiny. The number of context
switches is O(2N + 4) for each new allocated block, where N is the
number of pages: a context switch occurs at step 5, 6, 7, 8, 9/10 and
12. However, this can be reduced to O(4) for each new allocated block
with a simple modification:

https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk

The interesting thing about this pattern is that this can be done for
all page modification types except EMODT. For example, here's the same
process for changing a mapping from RW to RX:

https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY

My point in this thread has always been that it is an anti-feature to
presume that there is a need to treat EPC and VLA permissions
separately. This is a performance sink and it optimizes for a use case
which doesn't exist. Nobody actually wants there to be a mismatch
between EPC and VLA permissions.

So, besides EMODT, the only userspace interface we need is
mmap()/mprotect()/munmap(). The kernel should either succeed the
mmap()/mprotect()/munmap() syscall if the EPC permissions can be made
compatible or should fail otherwise.

Another interesting property arises from this flow. Since the EPC and
VLA permissions are always synchronized from the perspective of
userspace, in cases where the memory state between the kernel and the
exec layer is roughly synchronous, bookkeeping in the shim can be
implemented without any persistent memory between syscall handling
events. So, for example, the shim can implement brk() and
mmap()/munmap()/mprotect() with just two pointers: one to the break
position and one to the lowest mmap().

It is true that this basically commits enclave authors to doing all
EACCEPT calls immediately after modifications. But I suspect everyone
will do this anyway since there is no efficient (read: performant) way
for shims to handle page faults. So trying to do this lazily will just
result in a huge decrease in performance.