Re: ppc64le reliable stack unwinder and scheduled tasks

From: Joe Lawrence
Date: Sun Jan 13 2019 - 23:09:44 EST


On Fri, Jan 11, 2019 at 08:51:54AM +0100, Nicolai Stange wrote:
> Joe Lawrence <joe.lawrence@xxxxxxxxxx> writes:
>
> > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:

[ ... snip ... ]

> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
> >> from _switch() helps?
> >>
> >> I.e. something like (completely untested):
> >
> > I'll kick off some builds tonight and try to get tests lined up
> > tomorrow. Unfortunately these take a bit of time to run setup, schedule
> > and complete, so perhaps by next week.
>
> Ok, that's probably still a good test for confirmation, but the solution
> you proposed below is much better.

Good news, 40 tests (RHEL-7 backport) with clearing out the
STACK_FRAME_MARKER were all successful. Previously I would see stalled
livepatch transitions due to the unreliable report in ~10% of test
cases.

> >> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> >> index 435927f549c4..b747d0647ec4 100644
> >> --- a/arch/powerpc/kernel/entry_64.S
> >> +++ b/arch/powerpc/kernel/entry_64.S
> >> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
> >> SAVE_8GPRS(14, r1)
> >> SAVE_10GPRS(22, r1)
> >> std r0,_NIP(r1) /* Return to switch caller */
> >> +
> >> + li r23,0
> >> + std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */
> >> +
> >> mfcr r23
> >> std r23,_CCR(r1)
> >> std r1,KSP(r3) /* Set old stack pointer */
> >>
> >>
> >
> > This may be sufficient to avoid the condition, but if the contents of
> > frame 0 are truely uninitialized (not to be trusted), should the
> > unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
> > aside from the LR and other stack size geometry sanity checks?
>
> That's a very good point: we'll only ever be examining scheduled out tasks
> and current (which at that time is running klp_try_complete_transition()).
>
> current won't be in an interrupt/exception when it's walking its own
> stack. And scheduled out tasks can't have an exception/interrupt frame
> as their frame 0, correct?
>
> Thus, AFAICS, whenever klp_try_complete_transition() finds a
> STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you
> said.

I gave this about 20 tests and they were also all successful, what do
you think? (The log could probably use some trimming and cleanup.) I
think either solution would fix the issue.

-->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--