Re: [run_timer_softirq] BUG: unable to handle kernel paging request at 0000000000010007

From: Fengguang Wu
Date: Sat Nov 11 2017 - 10:35:14 EST


On Fri, Nov 10, 2017 at 10:29:59PM +0100, Thomas Gleixner wrote:
On Fri, 10 Nov 2017, Linus Torvalds wrote:

On Wed, Nov 8, 2017 at 9:19 PM, Fengguang Wu <fengguang.wu@xxxxxxxxx> wrote:
>
> Yes it's accessing the list. Here is the faddr2line output.

Ok, so it's a corrupted timer list. Which is not a big surprise.

It's

next->pprev = pprev;

in __hlist_del(), and the trapping instruction decodes as

mov %rdx,0x8(%rax)

with %rax having the value dead000000000200,

Which is just LIST_POISON2.

So we've deleted that entry twice - LIST_POISON2 is what hlist_del()
sets pprev to after already deleting it once.

Although in this case it might not be hlist_del(), because
detach_timer() also sets entry->next to LIST_POISON2.

Which is pretty bogus, we are supposed to use LIST_POISON1 for the
"next" pointer. Oh well. Nobody cares, except for the list entry
debugging code, which isn't run on the hlist cases.

Adding Thomas Gleixner to the cc. It should not be possible to delete
the same timer twice.

Right, it shouldn't.

Fengguang, can you please enable:

CONFIG_DEBUG_OBJECTS
CONFIG_DEBUG_OBJECTS_TIMERS

and try to reproduce? Debugobject should catch that hopefully.

Sure. However I've not got any results until now -- it's rather hard
to reproduce. I'll check possible results tomorrow.

Regards,
Fengguang