Re: [PATCH 0/9] x86/dumpstack: Cleanups and user opcode bytes Code: section, v2

From: Josh Poimboeuf
Date: Tue Apr 17 2018 - 16:17:01 EST


On Tue, Apr 17, 2018 at 04:40:42PM +0200, Borislav Petkov wrote:
> On Thu, Mar 15, 2018 at 10:51:06AM -0700, Linus Torvalds wrote:
> > This version looks ok to me. I'm sure there's room for tweaking here,
> > but I'm not seeing anything alarming.
>
> So I'm redoing the series ontop of 17-rc1 and I see a *lot* of output
> during testing. For example:
>
> 1) is from the userspace fault, 2) is the panic from sysrq but then you have 3)
> which is
>
> WARN_ON_ONCE(!cpu_online(new_cpu));
>
> in set_task_cpu() and to top it all off, we have 4) coming from
> native_smp_send_reschedule():
>
> static void native_smp_send_reschedule(int cpu)
> {
> if (unlikely(cpu_is_offline(cpu))) {
> WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
>
> so all the "fine tuning" we did to try to fit the most important splat
> on the screen is for shit because those loud WARNs simply pushed it all
> up into oblivion.
>
> And the executive summary and registers are just as worthless in such a
> case.
>
> We could start thinking about caching all that data from the very first
> splat, when we're not tainted yet and dump it last but then we can't
> even know what is going out last.
>
> Not only because we can't guess from where stuff might warn and what
> could execute - the below splats case-in-point - also, and more
> importantly, we don't know how much of that data would actually go out
> as there are no guarantees *when* the machine will die and stop spewing
> to the serial port.
>
> So maybe the most important splat coming out first is maybe a good thing
> because it has a higher chance of coming out before the box locks up
> completely.
>
> So I guess we should keep hoping that serial console works and keeps on
> working...
>
> Hmmm.

I don't think the stack tracing code could do anything better here. #3
and #4 seem like an issue with the scheduler, it doesn't realize the
rest of the CPUs have all been taken offline due to the panic().

--
Josh