Re: frequent lockups in 3.18rc4

From: Dave Jones
Date: Wed Nov 19 2014 - 16:49:36 EST


On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote:

> TIF_NOHZ is not the same thing as NOHZ. Can you try a kernel with
> CONFIG_CONTEXT_TRACKING=n? Doing that may involve fiddling with RCU
> settings a bit. The normal no HZ idle stuff has nothing to do with
> TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
> thread_info corruption going on here.

I'll try that next.

> > RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
>
> ^^^^^^^^^
>
> That is a strange coincidence. Where did 0x46 | (1<<32) come from?
> That's a sensible interrupts-disabled flags value with the high part set
> to 0x1. Those high bits are undefined, but they ought to all be zero.

This box is usually pretty solid, but it's been in service as a 24/7
fuzzing box for over a year now, so it's not outside the realm of
possibility that this could all be a hardware fault if some memory
has gone bad or the like. Unless we find something obvious in the
next few days, I'll try running memtest over the weekend (though
I've seen situations where that doesn't stress hardware enough to
manifest a problem, so it might not be entirely conclusive unless
it actually finds a fault).

I wish I had a second identical box to see if it would be reproducible.

> > [<ffffffff941689c6>] perf_read+0x226/0x370
> > [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
> > [<ffffffff941eafff>] vfs_read+0x9f/0x180
> > [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
> > [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9
>
> Riddle me this: what are we doing in tracesys_phase2? This is a full
> slow-path syscall. TIF_NOHZ doesn't cause that, I think. I'd love to
> see the value of ti->flags here. Is trinity using ptrace?

That's one of the few syscalls we actually blacklist (mostly because it
requires some more thinking: just passing it crap can get the fuzzer
into a confused state where it thinks child processes are dead, when
they aren't etc). So it shouldn't be calling ptrace ever.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/