Re: Random shadow stack pointer corruption

From: H.J. Lu
Date: Tue Jul 28 2020 - 20:36:15 EST


On Sat, Jul 18, 2020 at 4:35 PM Yu-cheng Yu <yu-cheng.yu@xxxxxxxxx> wrote:
>
> On Sat, 2020-07-18 at 15:41 -0700, Dave Hansen wrote:
> > On 7/18/20 11:24 AM, Yu-cheng Yu wrote:
> > > On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote:
> > > > On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@xxxxxxxxx> wrote:
> > > > > Hi,
> > > > >
> > > > > My shadow stack tests start to have random shadow stack pointer corruption after
> > > > > v5.7 (excluding). The symptom looks like some locking issue or the kernel is
> > > > > confused about which CPU a task is on. In later tip/master, this can be
> > > > > triggered by creating two tasks and each does continuous
> > > > > pthread_create()/pthread_join(). If the kernel has max_cpus=1, the issue goes
> > > > > away. I also checked XSAVES/XRSTORS, but this does not seem to be an issue
> > > > > coming from there.
> > > >
> > > > What do you mean "shadow stack pointer corruption"? Is SSP itself
> > > > corrupt while running in the kernel? Is one of the MSRs getting
> > > > corrupted? Is the memory to which the shadow stack points getting
> > > > corrupted? Is the CPU rejecting an attempt to change SSP?
> > >
> > > What I see is, a new thread after ret_from_fork() and iret back to ring-3,
> > > its shadow stack pointer (MSR_IA32_PL3_SSP) is corrupted.
> >
> > Does corrupt mean random? Or is it a valid stack address, just not for
> > _this_ thread? Or NULL? Or is it a kernel address? Have you tried
> > tracing *ALL* the WRMSR's and XRSTOR's that write to the MSR?
>
> When a shadow stack address is changed, the address appears to be other task's.
> I traced all WRMSR's and XRSTOR's. I also verified there have not been any
> XRSTORS from a wrong buffer. When rc6 is tagged, I will re-base, test, and
> share current patches.
>

We have identified that

ommit 91eeafea1e4b7c95cc4f38af186d7d48fceef89a
Author: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Date: Thu May 21 22:05:28 2020 +0200

x86/entry: Switch page fault exception to IDTENTRY_RAW

Convert page fault exceptions to IDTENTRY_RAW:

- Implement the C entry point with DEFINE_IDTENTRY_RAW
- Add the CR2 read into the exception handler
- Add the idtentry_enter/exit_cond_rcu() invocations in
in the regular page fault handler and in the async PF
part.
- Emit the ASM stub with DECLARE_IDTENTRY_RAW
- Remove the ASM idtentry in 64-bit
- Remove the CR2 read from 64-bit
- Remove the open coded ASM entry code in 32-bit
- Fix up the XEN/PV code
- Remove the old prototypes

No functional change.

triggered the shadow stack corruption when the process returned from syscall.
SSP MSR somehow was changed between setting SSP MSR and IRET. Could
there be a page fault between setting SSP MSR and IRET?

--
H.J.