Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)

From: Janos Haar
Date: Fri Apr 16 2010 - 04:01:33 EST



----- Original Message ----- From: "Dave Chinner" <david@xxxxxxxxxxxxx>
To: "Janos Haar" <janos.haar@xxxxxxxxxxxx>
Cc: <xiyou.wangcong@xxxxxxxxx>; <linux-kernel@xxxxxxxxxxxxxxx>; <kamezawa.hiroyu@xxxxxxxxxxxxxx>; <linux-mm@xxxxxxxxx>; <xfs@xxxxxxxxxxx>; <axboe@xxxxxxxxx>
Sent: Thursday, April 15, 2010 11:23 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)


On Thu, Apr 15, 2010 at 09:00:49AM +0200, Janos Haar wrote:
Dave,

The corruption + crash reproduced. (unfortunately)

http://download.netcenter.hu/bughunt/20100413/messages-15

Apr 14 01:06:33 alfa kernel: XFS mounting filesystem sdb2

This was the point of the xfs_repair more times.

OK, the inodes that are corrupted are different, so there's still
something funky going on here. I still would suggest replacing the
RAID controller to rule that out as the cause.

News:

(reminder from the actual state:
xfs_repair fixed the fs, than kernel reported again the corruption and crashed, i wrote the provious letter to report this.)

Yesterday i have stopped the service, and run xfs_repair (new version only) on 2 FS, but it was clean!
(this shows me, the reported corruption was only in memory, or the kernel repaired it on the reboot.)
(The XFS_Debug turned on before.)
Today morning i have another messages in the syslog from the sdb2 again.
At this point, i don't know what to think.

http://download.netcenter.hu/bughunt/20100413/messages-16

Regards,
Janos



FWIW, do you have any other servers with similar h/w, s/w and
workloads? If so, are they seeing problems?

Can you recompile the kernel with CONFIG_XFS_DEBUG enabled and
reboot into it before you repair and remount the filesystem again?
(i.e. so that we know that we have started with a clean filesystem
and the debug kernel) I'm hoping that this will catch the corruption
much sooner, perhaps before it gets to disk. Note that this will
cause the machine to panic when corruption is detected, and it is
much,much more careful about checking in memory structures so there
is a CPU overhead involved as well.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/