Re: x86: WARNING: kernel stack frame pointer in kworker has bad value

From: Josh Poimboeuf
Date: Tue Jan 24 2017 - 10:48:27 EST


On Tue, Jan 24, 2017 at 02:37:28PM +0100, Dmitry Vyukov wrote:
> On Mon, Jan 23, 2017 at 11:50 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> > On Mon, Jan 23, 2017 at 07:43:18PM +0100, Dmitry Vyukov wrote:
> >> Hello,
> >>
> >> I am getting the following warnings while running syzkaller fuzzer on
> >> 7a308bb3016f57e5be11a677d15b821536419d36:
> >>
> >> WARNING: kernel stack frame pointer at ffff880065f7fef8 in
> >> kworker/1:3:21075 has bad value ffffffff81598080
> >> unwind stack type:0 next_sp: (null) mask:2 graph_idx:0
> >> ...
> >>
> >> Here are 4 examples:
> >>
> >> https://gist.githubusercontent.com/dvyukov/7258fca11e5c3db6e8f8fb684e6168ff/raw/2ca14f3c7294c7e466611f8ecfb3072c676a657c/gistfile1.txt
> >> https://gist.githubusercontent.com/dvyukov/cc442b144ae712f0e1bdeb065085e36a/raw/f713cef7e9689914fc94838a0a7d4192224fb6d2/gistfile1.txt
> >> https://gist.githubusercontent.com/dvyukov/9058cce2647e3d37157c283c6400498c/raw/7a8ec8d71b9e95eeb24156073eeef96ba61d95ea/gistfile1.txt
> >> https://gist.githubusercontent.com/dvyukov/b899dfa72d611b511a674b63676921f3/raw/90203ed26b6674dedba8c1c16dd76309ef21fbea/gistfile1.txt
> >>
> >> Unfortunately this is not reproducible, but there is a stack dump, so
> >> maybe it's enough to figure out the root cause.
> >
> > This seems like some kind of stack corruption. Here are the stack
> > traces:
> >
> >> https://gist.githubusercontent.com/dvyukov/7258fca11e5c3db6e8f8fb684e6168ff/raw/2ca14f3c7294c7e466611f8ecfb3072c676a657c/gistfile1.txt
> >
> > ffff8800520eef68: ffffffff812b148e (__save_stack_trace+0x6e/0xd0)
> > ffff8800520eefe8: ffffffff812b1506 (save_stack_trace+0x16/0x20)
> > ffff8800520eeff8: ffffffff81a0c843 (save_stack+0x43/0xd0)
> > ffff8800520ef228: ffffffff81a0d17f (kasan_slab_free+0x6f/0xb0)
> > ffff8800520ef250: ffffffff81a0a4e1 (kmem_cache_free+0x71/0x240)
> > ffff8800520ef280: ffffffff822924d4 (ioc_release_fn+0x2e4/0x380)
> > ffff8800520ef550: ffffffff814b4900 (process_one_work+0xbd0/0x1c10)
> > ffff8800520ef9a8: ffffffff814b5b63 (worker_thread+0x223/0x1990)
> > ffff8800520eff00: ffffffff814ce526 (kthread+0x326/0x3f0)
> >
> >> https://gist.githubusercontent.com/dvyukov/cc442b144ae712f0e1bdeb065085e36a/raw/f713cef7e9689914fc94838a0a7d4192224fb6d2/gistfile1.txt
> >
> > ffff88005b40ef68: ffffffff812b260e (__save_stack_trace+0x6e/0xd0)
> > ffff88005b40efe8: ffffffff812b2686 (save_stack_trace+0x16/0x20)
> > ffff88005b40eff8: ffffffff81a0e8c3 (save_stack+0x43/0xd0)
> > ffff88005b40f228: ffffffff81a0f1ff (kasan_slab_free+0x6f/0xb0)
> > ffff88005b40f250: ffffffff81a0c561 (kmem_cache_free+0x71/0x240)
> > ffff88005b40f280: ffffffff822943b4 (ioc_release_fn+0x2e4/0x380)
> > ffff88005b40f550: ffffffff814b6980 (process_one_work+0xbd0/0x1c10)
> > ffff88005b40f9a8: ffffffff814b7be3 (worker_thread+0x223/0x1990)
> > ffff88005b40ff00: ffffffff814d05a6 (kthread+0x326/0x3f0)
> >
> >> https://gist.githubusercontent.com/dvyukov/9058cce2647e3d37157c283c6400498c/raw/7a8ec8d71b9e95eeb24156073eeef96ba61d95ea/gistfile1.txt
> >
> > ffff880065f7ef68: ffffffff812b260e (__save_stack_trace+0x6e/0xd0)
> > ffff880065f7efe8: ffffffff812b2686 (save_stack_trace+0x16/0x20)
> > ffff880065f7eff8: ffffffff81a0e8c3 (save_stack+0x43/0xd0)
> > ffff880065f7f228: ffffffff81a0f1ff (kasan_slab_free+0x6f/0xb0)
> > ffff880065f7f250: ffffffff81a0c561 (kmem_cache_free+0x71/0x240)
> > ffff880065f7f280: ffffffff822943b4 (ioc_release_fn+0x2e4/0x380)
> > ffff880065f7f550: ffffffff814b6980 (process_one_work+0xbd0/0x1c10)
> > ffff880065f7f9a8: ffffffff814b7be3 (worker_thread+0x223/0x1990)
> > ffff880065f7ff00: ffffffff814d05a6 (kthread+0x326/0x3f0)
> >
> >> https://gist.githubusercontent.com/dvyukov/b899dfa72d611b511a674b63676921f3/raw/90203ed26b6674dedba8c1c16dd76309ef21fbea/gistfile1.txt
> >
> > ffff880065e57220: ffffffff812b148e (__save_stack_trace+0x6e/0xd0)
> > ffff880065e572a0: ffffffff812b1506 (save_stack_trace+0x16/0x20)
> > ffff880065e572b0: ffffffff81a0c843 (save_stack+0x43/0xd0)
> > ffff880065e574e0: ffffffff81a0cb0a (kasan_kmalloc+0xaa/0xd0)
> > ffff880065e57520: ffffffff81a0d102 (kasan_slab_alloc+0x12/0x20)
> > ffff880065e57530: ffffffff81a08704 (kmem_cache_alloc+0xf4/0x680)
> > ffff880065e57598: ffffffff823293b3 (ida_pre_get+0x1f3/0x3b0)
> > ffff880065e57720: ffffffff823296d8 (ida_simple_get+0x168/0x320)
> > ffff880065e578e8: ffffffff814ad926 (create_worker+0x96/0x620)
> > ffff880065e579a8: ffffffff814b68ad (worker_thread+0xf6d/0x1990)
> > ffff880065e57f00: ffffffff814ce526 (kthread+0x326/0x3f0)
> >
> > In all four cases, the frame pointer pushed by worker_thread() is
> > somehow changed to be the address at the end of
> > debug_check_no_locks_freed():
> >
> > ffff880065f7fef8: ffffffff81598080 (debug_check_no_locks_freed+0x5c0/0x5c0)
> >
> > Can you please share the following:
> >
> > - The output of "grep ffffffff81598080 System.map"
> > - The .config file
>
>
> Hi,
>
> Thanks for looking at this.
>
> I've attached my config.
> Unfortunately I don't have that System.map/vmlinux, I've rebased my
> kernel and rebuilt it. But here is what after
> debug_check_no_locks_freed how:
>
> vmlinux:ffffffff815956a0 0000000000000467 T lockdep_count_backward_deps
> vmlinux:ffffffff81595b10 00000000000005b6 T debug_check_no_locks_freed
> vmlinux:ffffffff815960d0 0000000000003428 t __lock_acquire
>
>
> But If you suspect a corruption, then I would suggest to stop spending
> more time on this now. At least until I have more data.
> This happened while I stress tested KVM, and I've seen a low rate of
> what looks like random memory corruptions (e.g. corrupted slab
> freelists and other unexplainable things). So let's write it off onto
> KVM for now.

Ok. But there are some interesting clues there, so if you see it again,
please provide the System.map and .config and I'll take another look.

What really has me confused is why the worker_thread() stack seems so
big. According to the stack dumps, worker_thread() had a stack size of
1368 bytes, but when I build with your .config, it's only 208 bytes.

Other interesting features of this corruption:

- always after kthread() called worker_thread()
- always the same value written, presumably the address of
__lock_acquire()
- always to the same offset on the stack (the location of the frame
pointer pushed by worker_thread())

... so it's not exactly random memory corruption. In fact it looks like
"normal" stack corruption. But then again, KASAN is enabled, which
should catch most cases of stack corruption (buffer overflows).

--
Josh