Re: BUG: Sporadic crashes with current Linus tree

From: Thomas Gleixner
Date: Fri Sep 15 2017 - 03:09:45 EST


On Thu, 14 Sep 2017, Andy Lutomirski wrote:
> On Thu, Sep 14, 2017 at 9:00 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > On Thu, 14 Sep 2017, Andy Lutomirski wrote:
> >> On Thu, Sep 14, 2017 at 12:38 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >> > Hi!
> >> >
> >> > I've seen the following crash sporadically with commit 46c1e79fee:
> >> >
> >> > Have not seen that with 3882a734c19b, though I saw the PCID warnings on
> >> > that machine.
> >> >
> >> > I have no idea how to reproduce so bisecting is pretty much pointless. Any
> >> > idea what to do?
> >>
> >> Does tools/testing/selftests/x86/sigreturn_64 reproduce it?
> >
> > Will try tomorrow once I figured out how to compile that stuff. Invoking a
> > simple make in that directory fails.
>
> What's the error? It works for me.

gcc -m64 -o /home/tglx/work/kernel/linus/linux/tools/testing/selftests/x86/sysret_ss_attrs_64 -O2 -g -std=gnu99 -pthread -Wall sysret_ss_attrs.c thunks.S -lrt -ldl
/usr/bin/ld: /tmp/cco4vSkU.o: relocation R_X86_64_32S against `.text' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status

> >
> > Built it manually and when I run it tells: stack16 is too high
> >
> >> Ugh, weird. It kind of looks like current->thread.sp0 == NULL. I
> >> have a patch series that changes a bunch of that code in my git tree,
> >> but that's definitely not in Linus' tree.
> >
> > Right. The stupid thing is that the machine did not throw up all day
> > neither idle nor loaded. Still the same kernel which barfed tonight several
> > times.
>
> This is weird. The crashing process is rsyslogd, which should have
> been running for a long time and shouldn't have any strange state. I
> wonder if this is some kind of memory corruption. There would have to
> be corruption of thread_struct *and* some kind of issue causing IRET
> to fail, though.
>
> The attached patch could plausibly give some useful hint.

I'll put it on that machine and hope it will reproduce. Didn't die since
yesterday moring ....

Thanks,

tglx