Re: [PATCH] arch/arm64 : fix error in dump_backtrace

From: Dave Martin
Date: Tue Nov 06 2018 - 09:24:54 EST


On Tue, Nov 06, 2018 at 12:29:33PM +0000, Mark Rutland wrote:
> On Tue, Nov 06, 2018 at 11:32:50AM +0000, Dave P Martin wrote:
> > On Tue, Nov 06, 2018 at 11:00:19AM +0000, Mark Rutland wrote:
> > > On Tue, Nov 06, 2018 at 08:57:51AM +0000, Daniel Thompson wrote:
> > > > On Tue, Nov 06, 2018 at 08:39:01AM +0000, Mark Rutland wrote:
> > > > > On Tue, Nov 06, 2018 at 03:19:35PM +0800, Zhaoyang Huang wrote:
> > > > > > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> > > > > >
> > > > > > In some cases, the instruction of "bl foo1" will be the last one of the
> > > > > > foo2[1], which will cause the lr be the first instruction of the adjacent
> > > > > > foo3[2]. Hence, the backtrace will show the weird result as bellow[3].
> > > > > > The patch will fix it by miner 4 of the lr when dump_backtrace
> > > > >
> > > > > This has come up in the past (and a similar patch has been applied, then
> > > > > reverted).
> > > > >
> > > > > In general, we don't know that a function call was made via BL, and therefore
> > > > > cannot know that LR - 4 is the address of the caller. The caller could set up
> > > > > the LR as it likes, then B or BR to the callee, and depending on how the basic
> > > > > blocks get laid out in memory, LR - 4 might point at something completely
> > > > > different.
> > > > >
> > > > > More ideally, the compiler wouldn't end a function with a BL. When does that
> > > > > happen, and is there some way we could arrange for that to not happen? e.g.
> > > > > somehow pad a NOP after the BL.
> > > >
> > > > It's a consequence of having __noreturn isn't it? __noreturn frees the
> > > > compiler from the burden of having to produce a valid return stack... so
> > > > it doesn't and unwinding becomes hard.
> > >
> > > In that case, the compiler could equally just use B rather than BL,
> > > which this patch doesn't solve.
> > >
> > > The documentation for the GCC noreturn attribute [1] says:
> > >
> > > | In order to preserve backtraces, GCC will never turn calls to noreturn
> > > | functions into tail calls.
> > >
> > > ... so clearly it's not intended to mess up backtracing.
> >
> > Which is a bit odd, since every call to a noreturn function is a tail-
> > call by definition, and other tail-calls are typically optimised to a B
> > (thus interfering with backtracing in all except the noreturn case).
> >
> > Avoiding this would require a change to the compiler, and since there's
> > no obvious correct answer anyway, I guess we shouldn't hold our breath.
> >
> > > IIUC we mostly use noreturn to prevent warings about uninitialised
> > > variables and such after a call to a noreturn function. I think
> > > optimization is a secondary concern.
> >
> > Agreed.
> >
> > > We could ask the GCC folk if they can ensure that a noreturn function
> > > call leave thes LR pointing into the caller, e.g. by padding with a NOP:
> > >
> > > BL <noreturn function>
> > > NOP
> > >
> > > That seems cheap enough, and would keep backtraces reliable.
> >
> > -fpatchable-function-entry=1,1 does almost the right thing, by
> > inserting 1 NOP at the start of each function, and putting the function
> > entry point after that (1) NOP.
>
> Neat hack, but unfortunately insufficient in general since:
>
> * The next function may be notrace or asm, and hence will not have a
> preceding NOP.

Usually not, but yes, it would be true for some cases.

> * There may not be a next function (e.g. for the final instruction in
> .text), and the LR value may point at some data symbol.

Ditto.

> * This relies on the preceding NOP being accounted as part of the
> previous function, which feels like a bug given we should have the
> function size somewhere.

I assumed that the function sizes (in the ELF symbol senses) are simply
ignored and the kernel presumes that each function ends where the next
function starts.

If so, the NOP would be accounted with the preceding function (if any).
I may be misremembering how the kallsyms determines symbol sizes though.

> Generally, I think that trying to bodge around the exiting behaviour is
> going to cause just as many problems as it solves, and worse, makes it
> harder to consistently analyse a backtrace.
>
> IMO, we shouldn't change the kernel here.

Agreed. -fpatchable-function-entry is an interesting hack here, but
doesn't solve the whole problem and partly works by accident
(unfortunately).

Cheers
---Dave