Re: [PATCH V2 2/7] x86/entry: Switch the stack after error_entry() returns

From: Lai Jiangshan
Date: Mon Mar 07 2022 - 09:39:49 EST


On Thu, Mar 3, 2022 at 4:00 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Thu, Mar 03, 2022 at 11:54:29AM +0800, Lai Jiangshan wrote:
> > From: Lai Jiangshan <jiangshan.ljs@xxxxxxxxxxxx>
> >
> > error_entry() calls sync_regs() to settle/copy the pt_regs and switches
> > the stack directly after sync_regs(). But error_entry() itself is also
> > a function call, the switching has to handle the return address of it
> > together, which causes the work complicated and tangly.
> >
> > Switching to the stack after error_entry() makes the code simpler and
> > intuitive.
> >
> > Signed-off-by: Lai Jiangshan <jiangshan.ljs@xxxxxxxxxxxx>
> > ---
> > arch/x86/entry/entry_64.S | 16 ++++++----------
> > 1 file changed, 6 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 24846284eda5..a51cad2b7fff 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -323,6 +323,8 @@ SYM_CODE_END(ret_from_fork)
> > .macro idtentry_body cfunc has_error_code:req
> >
> > call error_entry
> > + movq %rax, %rsp /* switch stack settled by sync_regs() */
> > + ENCODE_FRAME_POINTER
> > UNWIND_HINT_REGS
> >
> > movq %rsp, %rdi /* pt_regs pointer into 1st argument*/
> > @@ -980,14 +982,10 @@ SYM_CODE_START_LOCAL(error_entry)
> > /* We have user CR3. Change to kernel CR3. */
> > SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> >
> > + leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
> > .Lerror_entry_from_usermode_after_swapgs:
> > /* Put us onto the real thread stack. */
> > - popq %r12 /* save return addr in %12 */
> > - movq %rsp, %rdi /* arg0 = pt_regs pointer */
> > call sync_regs
> > - movq %rax, %rsp /* switch stack */
> > - ENCODE_FRAME_POINTER
> > - pushq %r12
> > RET
> >
> > /*
> > @@ -1019,6 +1017,7 @@ SYM_CODE_START_LOCAL(error_entry)
> > */
> > .Lerror_entry_done_lfence:
> > FENCE_SWAPGS_KERNEL_ENTRY
> > + leaq 8(%rsp), %rax /* return pt_regs pointer */
> > RET
> >
> > .Lbstep_iret:
> > @@ -1039,12 +1038,9 @@ SYM_CODE_START_LOCAL(error_entry)
> > * Pretend that the exception came from user mode: set up pt_regs
> > * as if we faulted immediately after IRET.
> > */
> > - popq %r12 /* save return addr in %12 */
> > - movq %rsp, %rdi /* arg0 = pt_regs pointer */
> > + leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
> > call fixup_bad_iret
> > - mov %rax, %rsp
> > - ENCODE_FRAME_POINTER
> > - pushq %r12
> > + mov %rax, %rdi
> > jmp .Lerror_entry_from_usermode_after_swapgs
> > SYM_CODE_END(error_entry)
>
> I can't seem to follow; perhaps I need more wake-up-juice?
>
> Suppose we have .Lerror_bad_iret, then the code flow is something like:
>
>
> old new
>
> .Lerror_bad_iret
> SWAWPGS business
>
> popq %r12 leaq 8(%rsp), %rsi
> movq %rsp, %rdi
> call fixup_bad_iret call fixup_bad_iret
> mov %rax, %rsp mov %rax, %rdi
> ENCODE_FRAME_POINTER
> pushq %r12
>
> jmp .Lerror_entry_from_usermode_after_swapgs
>
> .Lerror_entry_from_usermode_after_swapgs
> pop %r12
> movq %rsp, %rdi
> call sync_regs call sync_regs
> mov %rax, %rsp
> ENCODE_FRAME_POINTER
> push %r12
> RET RET
>
>
> The thing that appears to me is that syn_regs is called one stack switch
> short. This isn't mentioned in the Changelog nor in the comments. Why is
> this OK?
>
>

I has not been confident enough with the meaning of "short" or
"call short" in my understanding of English.

So I tried hard to find the semantic difference between the code
before and after the patch.

The logic are the same:
1) feed fixup_bad_iret() with the pt_regs pushed by ASM code
2) fixup_bad_iret() moves the partial pt_regs up
3) feed sync_regs() with the pt_regs returned by fixup_bad_iret()
4) sync_regs() copies the whole pt_regs to kernel stack if needed
5) after error_entry(), it is in kernel stack with the pt_regs

Differences:
Old code switches to copied pt_regs immediately twice in
error_entry() while new code switches to the copied pt_regs only
once after error_entry() returns.
It is correct since sync_regs() doesn't need to be called close
to the pt_regs it handles. And this is the point of this
patch which switches stack once only.

Old code stashes the return-address of error_entry() in a scratch
register and new code doesn't stash it.
It relies on the fact that fixup_bad_iret() and sync_regs() don't
corrupt the return-address of error_entry(). But even the old code
also relies on the fact that fixup_bad_iret() and sync_regs() don't
corrupt the return-address of themselves. They are the same reliance.

The code had been tested with my additial tracing code in kernel and
the x86 selftest code which includes all kinds of cases for bad iret.
(tools/testing/selftests/x86/sigreturn.c)