Re: 2.1.117 NFS Oops

Jes Sorensen (Jes.Sorensen@cern.ch)
20 Aug 1998 20:35:54 +0200

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Camm Maguire: "2.0.35 reproducile datagram oops"
Previous message: mozgy@zesoi.fer.hr: "Re: missing kernel symbols in 2.1.117"
Next in thread: Alan Cox: "Re: missing kernel symbols in 2.1.117"
Reply: Alan Cox: "Re: missing kernel symbols in 2.1.117"

>>>>> "Trond" == Trond Myklebust <trond.myklebust@fys.uio.no> writes:

>>>>> " " == Achim Oppelt <aoppelt@theorie3.physik.uni-erlangen.de> writes:
>> Hello, we can reproducde a kernel Oops on a dual PPro running
>> 2.1.117 SMP. The program in question dies with a segmentation fault
>> when running in a directory mounted over NFS, it works great when
>> run locally. The relevant

Trond> I'm seeing the same thing on a UP 2.1.117 kernel. My home
Trond> directory is NFS mounted via autofs, and at times causes
Trond> netscape to crash. I cannot reproduce this under 2.1.116.

Actually I think this one looks a bit more like the problem we see
with 2.1.117 and NFS root mounting.

You code:

>Code: 8b 40 5c movl 0x5c(%eax),%eax
>Code: 8b 40 44 movl 0x44(%eax),%eax
>Code: 50 pushl %eax
>Code: e8 20 db fb ff call fffbdb2c <_EIP+0xfffbdb2c>
>Code: 83 c4 08 addl $0x8,%esp
>Code: e9 04 ff ff ff jmp ffffff18 <_EIP+0xffffff18>

Looks remarkably similar to the place where it crashes for us, except
that the addresses look kinda fishy to me:

0xc0148d4d <nfs_flush_dirty_pages+377>: leal 0x1c(%esp,1),%eax
0xc0148d51 <nfs_flush_dirty_pages+381>: pushl %eax
0xc0148d52 <nfs_flush_dirty_pages+382>: movl 0x78(%ebp),%eax
0xc0148d55 <nfs_flush_dirty_pages+385>: movl 0x5c(%eax),%eax <- BOOM
0xc0148d58 <nfs_flush_dirty_pages+388>: movl 0x44(%eax),%eax
0xc0148d5b <nfs_flush_dirty_pages+391>: pushl %eax
0xc0148d5c <nfs_flush_dirty_pages+392>: call 0xc01726b0 <rpc_clnt_sigunmask>
0xc0148d61 <nfs_flush_dirty_pages+397>: addl $0x8,%esp
0xc0148d64 <nfs_flush_dirty_pages+400>: jmp 0xc0148c18 <nfs_flush_dirty_pages+68>

Which I believe corresponds to the following piece of code from
fs/nfs/write.c:

static inline int
wait_on_write_request(struct nfs_wreq *req)
{
struct wait_queue wait = { current, NULL };
struct page *page = req->wb_page;
int retval;
sigset_t oldmask;

rpc_clnt_sigmask(NFS_CLIENT(req->wb_inode), &oldmask);
add_wait_queue(&page->wait, &wait);
atomic_inc(&page->count);
for (;;) {
current->state = IS_SOFT ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
retval = 0;
if (!PageLocked(page))
break;
retval = -ERESTARTSYS;
/* IS_SOFT is a timeout item .. */
if (signalled())
break;
schedule();
}
remove_wait_queue(&page->wait, &wait);
current->state = TASK_RUNNING;
/* N.B. page may have been unused, so we must use free_page() */
free_page(page_address(page));
rpc_clnt_sigunmask(NFS_CLIENT(req->wb_inode), &oldmask);

If I am right (though I am no expert i386 assembly programmer) then it
looks like someone clears the inode pointer stored in req->wb_inode
during the schedule()? Thus the reference to the inode in the
NFS_CLIENT macro will reference a bogus NULL pointer.

I have always tried to stay away from the fs code so I wouldn't know
how this can happen or whether the addition of the rpc_clnt_sig*mask()
calls is used correctly (it was introduced somewhere between 2.1.114
and 2.1.117).

Jes

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html

Next message: Camm Maguire: "2.0.35 reproducile datagram oops"
Previous message: mozgy@zesoi.fer.hr: "Re: missing kernel symbols in 2.1.117"
Next in thread: Alan Cox: "Re: missing kernel symbols in 2.1.117"
Reply: Alan Cox: "Re: missing kernel symbols in 2.1.117"