*OOPS* w/ NFS on 2.1.43

Taner Halicioglu (taner@isi.net)
Wed, 2 Jul 1997 18:15:50 -0700 (PDT)


Ok, I had sent a message to this list about NFS problems (NFS client, not
server) w/ Linux 2.1.43, and I was doing some testing -- I simply ran
about 50 simultaneous ls -lR's in an nfs-mounted dir.. and got this OOPS:

Jul 2 17:48:31 bigjohnson kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000004
Jul 2 17:48:31 bigjohnson kernel: current->tss.cr3 = 042f3000, 8r3 = 042f3000
Jul 2 17:48:31 bigjohnson kernel: *pde = 00000000
Jul 2 17:48:31 bigjohnson kernel: Oops: 0002
Jul 2 17:48:31 bigjohnson kernel: CPU: 0
Jul 2 17:48:31 bigjohnson kernel: EIP: 0010:[<c018ae7e>]
Jul 2 17:48:31 bigjohnson kernel: EFLAGS: 00010286
Jul 2 17:48:31 bigjohnson kernel: eax: 00000000 ebx: c008d050 ecx: cd2d5e30 edx: cd2d5e30
Jul 2 17:48:31 bigjohnson kernel: esi: c008d3e0 edi: c008d000 ebp: cd2d5e30 esp: cd2d5d84
Jul 2 17:48:31 bigjohnson kernel: ds: 0018 es: 0018 ss: 0018
Jul 2 17:48:31 bigjohnson kernel: Process ls (pid: 13219, process nr: 43, stackpage=cd2d5000)
Jul 2 17:48:31 bigjohnson kernel: Stack: cd2d5e30 c018a082 c008d050 cd2d5e30 cd2d5e30 cd2d4000 00000000 cd2d4000
Jul 2 17:48:31 bigjohnson kernel: c008d000 c008d37c cd2d5e30 c008d37c c0159c38 c0096aa0 c0188a79 c008d37c
Jul 2 17:48:31 bigjohnson kernel: c000f090 cd2d5ed8 cd2d5e30 c0188ae0 cd2d5e30 cd2d5e30 c018b3e2 cd2d5e30
Jul 2 17:48:31 bigjohnson kernel: Call Trace: [<c018a082>] [<c0159c38>] [<c0188a79>] [<c0188ae0>] [<c018b3e2>] [<c018b564>] [<c018870a>]
Jul 2 17:48:31 bigjohnson kernel: [<c0188ae8>] [<c01595df>] [<c0157095>] [<c0132c67>] [<c0132984>] [<c0156cb0>] [<c01098fa>] [<c010002b>]
Jul 2 17:48:31 bigjohnson kernel: Code: 89 48 04 89 0a 89 51 04 89 01 89 59 24 f6 05 ac 70 1e c0 40

Decoded, we have:

[root]@bigjohnson:/tmp #./ksymoops System.map < OOPS
Using `System.map' to map addresses to symbols.

>>EIP: c018ae7e <rpc_add_wait_queue+62/9c>
Trace: c018a082 <xprt_transmit+17a/3e4>
Trace: c0159c38 <nfs_xdr_readdirargs>
Trace: c0188a79 <call_encode+c9/f8>
Trace: c0188ae0 <call_transmit+38/40>
Trace: c018b3e2 <__rpc_execute+9a/1d8>
Trace: c018b564 <rpc_execute+44/50>
Trace: c018870a <rpc_do_call+106/138>
Trace: c0188ae8 <call_receive>
Trace: c01595df <nfs_proc_readdir+eb/11c>
Trace: c0157095 <nfs_readdir+3e5/4f0>
Trace: c0132c67 <sys_getdents+19f/3ec>
Trace: c0132984 <filldir>
Trace: c0157095 <nfs_readdir+3e5/4f0>
Trace: c01098fa <system_call+3a/40>
Trace: c010002b <startup_32+2b/b8>

Code: c018ae7e <rpc_add_wait_queue+62/9c> movl %ecx,0x4(%eax)
Code: c018ae81 <rpc_add_wait_queue+65/9c> movl %ecx,(%edx)
Code: c018ae83 <rpc_add_wait_queue+67/9c> movl %edx,0x4(%ecx)
Code: c018ae86 <rpc_add_wait_queue+6a/9c> movl %eax,(%ecx)
Code: c018ae88 <rpc_add_wait_queue+6c/9c> movl %ebx,0x24(%ecx)
Code: c018ae8b <rpc_add_wait_queue+6f/9c> testb $0x40,0xc01e70ac
Code: c018ae92 <rpc_add_wait_queue+76/9c>

I've included my original email at the bottom of this one, but to recap:

Client:
* Dual PPro 200 - Linux 2.1.43
* 256M RAM
* SCSI controller = AIC-7880
* eth0: Intel EtherExpress Pro 10/100
* mount options: rw, hard, intr

Server:
* 4-proc R10000 SGI Challenge - IRIX 6.2
* 1.2G RAM
* exporting an XLV volume
* all the latest NFS and XLV patches

I did notice this, however:

Jul 2 18:03:31 bigjohnson kernel: nfs warning: mount version older than kernel

Where should I find a newer mount? Or should I worry about this? How
might I get a more stable mount?

Thanks, again.

-Taner

--
      D. Taner Halicioglu                     taner@isi.net
  Programmer/Engineer/Sysadmin              ISI / GlobalCenter
    Voice: +1 408 543 0313                 Fax: +1 408 541 9878
 PGP Fingerprint: 65 0D 03 A8 26 21 6D B8  23 3A D6 67 23 6E C0 36

---------- Forwarded message ---------- Date: Wed, 2 Jul 1997 00:49:49 -0700 (PDT) From: Taner Halicioglu <taner@isi.net> To: Linux kernel list <linux-kernel@vger.rutgers.edu> Subject: NFS weirdness Linux(2.1.43) + IRIX(6.2)

I'm mounting an NFS partition from an SGI box (an xlv volume) on a dual PPro Linux machine, and way too often I get:

Jul 2 00:40:19 big kernel: nfs_statfs: statfs error = 5 Jul 2 00:41:54 big kernel: nfs_statfs: statfs error = 5 Jul 2 00:42:27 big kernel: nfs: server stats.isi.net not responding, still trying Jul 2 00:42:37 big kernel: nfs_statfs: statfs error = 5

on the Linux box. Should I be looking at my Linux box (2.1.43) as being in error, or the SGI (Challenge, IRIX 6.2+patches)?

Once I get these messages, any process that was in the middle of reading/writing hangs, and anything that tries to access that nfs partition seems to get an I/O Error (df simply pauses for a while, then moves on)

If I kill off the nfsd and rpc.mountd processes, and then restart them (on the SGI) and then kill any hung processes on the Linux box, things will seem to return to normal after a minute.

Any ideas as to what I should be looking at in order to debug this problem?

-Taner

--
      D. Taner Halicioglu                     taner@isi.net
  Programmer/Engineer/Sysadmin              ISI / GlobalCenter
    Voice: +1 408 543 0313                 Fax: +1 408 541 9878
 PGP Fingerprint: 65 0D 03 A8 26 21 6D B8  23 3A D6 67 23 6E C0 36