Re: [XFS on bad superblock] BUG: unable to handle kernel NULLpointer dereference at 00000003

From: Dave Chinner
Date: Wed Oct 09 2013 - 23:15:34 EST

Next message: Fengguang Wu: "Re: [XFS on bad superblock] BUG: unable to handle kernel NULLpointer dereference at 00000003"
Previous message: Linus Torvalds: "Re: [PATCH v8 0/9] rwsem performance optimizations"
In reply to: Fengguang Wu: "Re: [XFS on bad superblock] BUG: unable to handle kernel NULLpointer dereference at 00000003"
Next in thread: Fengguang Wu: "Re: [XFS on bad superblock] BUG: unable to handle kernel NULLpointer dereference at 00000003"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Oct 10, 2013 at 09:41:17AM +0800, Fengguang Wu wrote:
> On Thu, Oct 10, 2013 at 09:16:40AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 10, 2013 at 11:59:00AM +1100, Dave Chinner wrote:
> > > [add xfs@xxxxxxxxxxx to cc]
> >
> > Thanks.
> >
> > To help debug the problem, I searched XFS in my tests' oops database
> > and find one kernel that failed 4 times (out of 12 total boots) with
> > basically the same error:
> >
> > 4 BUG: sleeping function called from invalid context at kernel/workqueue.c:2810
> > 1 WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 debug_print_object+0x94/0xa2()
> > 1 WARNING: CPU: 1 PID: 360 at lib/debugobjects.c:260 debug_print_object+0x94/0xa2()
> > 1 WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 debug_print_object+0x94/0xa2()
> > 1 WARNING: CPU: 0 PID: 361 at lib/debugobjects.c:260 debug_print_object+0x94/0xa2()
>

Fenguang, I'll having real trouble associating these with the XFS
code path that is seeing the problems. These look like a use after
free or a double free, but that isn't possible in the XFS code paths
that are showing up in the traces.

> And some other messages in an older kernel:
>
> [ 39.004416] F2FS-fs (nbd2): unable to read second superblock
> [ 39.005088] XFS: Assertion failed: read && bp->b_ops, file: fs/xfs/xfs_buf.c, line: 1036

This can not possibily occur on the superblock read path, as
bp->b_ops in that case is *always* initialised, as is XBF_READ.

So this implies something else has modified the struct xfs_buf.

> [ 41.550471] ------------[ cut here ]------------
> [ 41.550476] WARNING: CPU: 1 PID: 878 at lib/list_debug.c:33 __list_add+0xac/0xc0()
> [ 41.550478] list_add corruption. prev->next should be next (ffff88000f3d7360), but was (null). (prev=ffff880008786a30).

And this is a smoking gun - list corruption...

> [ 41.550481] CPU: 1 PID: 878 Comm: mount Not tainted 3.11.0-rc1-00667-gf70eb07 #64
> [ 41.550482] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 41.550485] 0000000000000009 ffff880007d6fb08 ffffffff824044a1 ffff880007d6fb50
> [ 41.550488] ffff880007d6fb40 ffffffff8109a0a8 ffff880007c6b530 ffff88000f3d7360
> [ 41.550491] ffff880008786a30 0000000000000007 0000000000000000 ffff880007d6fba0
> [ 41.550491] Call Trace:
> [ 41.550499] [<ffffffff824044a1>] dump_stack+0x4e/0x82
> [ 41.550503] [<ffffffff8109a0a8>] warn_slowpath_common+0x78/0xa0
> [ 41.550505] [<ffffffff8109a14c>] warn_slowpath_fmt+0x4c/0x50
> [ 41.550509] [<ffffffff81101359>] ? get_lock_stats+0x19/0x60
> [ 41.550511] [<ffffffff8163434c>] __list_add+0xac/0xc0
> [ 41.550515] [<ffffffff810ba453>] insert_work+0x43/0xa0
> [ 41.550518] [<ffffffff810bb22b>] __queue_work+0x11b/0x510
> [ 41.550520] [<ffffffff810bb936>] queue_work_on+0x96/0xa0
> [ 41.550526] [<ffffffff813d2096>] ? _xfs_buf_ioend.constprop.15+0x26/0x30
> [ 41.550529] [<ffffffff813d1f6c>] xfs_buf_ioend+0x15c/0x260

... in the workqueue code on a work item in the the struct xfs_buf .....

> [ 41.550531] [<ffffffff813d2f92>] ? xfsbdstrat+0x22/0x170
> [ 41.550534] [<ffffffff813d2096>] _xfs_buf_ioend.constprop.15+0x26/0x30
> [ 41.550537] [<ffffffff813d2873>] xfs_buf_iorequest+0x73/0x1a0
> [ 41.550539] [<ffffffff813d2f92>] xfsbdstrat+0x22/0x170
> [ 41.550542] [<ffffffff813d3832>] xfs_buf_read_uncached+0x72/0xa0
> [ 41.550546] [<ffffffff81445846>] xfs_readsb+0x176/0x250

... in the very context that we allocated the struct xfs_buf. It's
not a use after free or memory corruption caused by XFS you are
seeing here.

I note that you have CONFIG_SLUB=y, which means that the cache slabs
are shared with objects of other types. That means that the memory
corruption problem is likely to be caused by one of the other
filesystems that is probing the block device(s), not XFS.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Fengguang Wu: "Re: [XFS on bad superblock] BUG: unable to handle kernel NULLpointer dereference at 00000003"
Previous message: Linus Torvalds: "Re: [PATCH v8 0/9] rwsem performance optimizations"
In reply to: Fengguang Wu: "Re: [XFS on bad superblock] BUG: unable to handle kernel NULLpointer dereference at 00000003"
Next in thread: Fengguang Wu: "Re: [XFS on bad superblock] BUG: unable to handle kernel NULLpointer dereference at 00000003"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]