Re: NFS oops in 2.6.26rc4

From: Chuck Lever
Date: Wed Jun 04 2008 - 15:13:53 EST


On Wed, Jun 4, 2008 at 2:20 PM, Dave Jones <davej@xxxxxxxxxx> wrote:
> On Wed, Jun 04, 2008 at 02:13:08PM -0400, Chuck Lever wrote:
> >
> > On Jun 4, 2008, at 10:19 AM, Dave Jones wrote:
> >
> > > On Fri, May 30, 2008 at 03:37:01PM -0400, Chuck Lever wrote:
> > >
> > >>> Something else of note which I hadn't seen before, usually things
> > >>> lock
> > >>> up just after that first oops. For some reason, today it survived
> > >>> a little longer, but things really went downhill fast.
> > >>> It survived a 'dmesg ; scp dmesg davej@gelk', and then wedged solid.
> > >>> So as well as the oops, it seems we're corrupting memory too.
> > >>> For reference, this kernel has both SLUB_DEBUG and PAGEALLOC_DEBUG
> > >>> enabled.
> > >>
> > >> I haven't seen this kind of problem here with .26, but yes, it does
> > >> look like something is clobbering memory during an NFS mount.
> > >>
> > >> I introduced some NFS mount parsing changes in this commit range:
> > >>
> > >> 2d767432..82d101d5
> > >>
> > >> A quick bisect should show which, if any of these, is the guilty
> > >> party. If any of these are the problem, I suspect it's 3f8400d1.
> > >
> > > I didn't get time to try this out yet (hopefully tomorrow).
> > > In the meantime, we've just gotten word of another user seeing memory
> > > corruption with nfs - https://bugzilla.redhat.com/show_bug.cgi?id=449958
> >
> > 449958 could very well be the same problem. The stack traceback is a
> > lot cleaner than the one you originally sent, but there are a lot of
> > similarities. (I doubt this is related to symlinks, as the comment
> > suggests).
> >
> > Is commit 86d61d863 applied to the current rawhide kernel?
>
> That kernel was .26rc4.git2, so unless it's only gone in in the last day
> or two, yes. (Bandwidth impaired right now, and no local git repo to check)

Argh, I was afraid of that. I expected that commit to improve things.
Maybe it did, but this is a different problem? You're going to force
me to actually think about this. :-)

In any event, a bisect would be helpful here, when you can. I will
also stare at the traceback in 449958 and see if anything new jumps
out. It's certainly taken the heat off of the NFS client; it looks
like an rpcbind issue.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/