Problem: Oops with NFS-root in 2.0.36

Brian Aagaard Petersen (aagaard@alf.nbi.dk)
Sat, 19 Jun 1999 17:08:42 +0200 (MET DST)


Hi,

I am having problem with random crashes with NFS-root in kernel 2.0.36.

[2.] Full description of the problem:
We have several machines which use "etherboot" from a floppy
to boot via NFS. The machines are not actually diskless, so
/var is placed on the local disk and the client root filesystem is exported
read-only from the server with /var linked to the local disk. This seems to
work fine, but we continue to see an oops now and then, but nothing
reproducable, except it occurs during NFS access. Today I was able to crash
3 different machines in a row with a simple "scp" command, but now they don't
want to crash...

One thing that might confuse the kernel is the fact that a directory is
mounted twice via NFS: The client root filesystem is placed on a server disk
as /disk/farm/ together with a lot of other stuff and /disk is also NFS
mounted, e.g. we have something like this:

/dev/root 16244805 15381058 863747 95% /
server:/disk 16244805 15381058 863747 95% /disk

where /dev/root is an NFS mount of /nfs/root.
Is this allowed?

[4.] Kernel version (from /proc/version):
Linux version 2.0.36 (root@xxx) (gcc version 2.7.2.3) #1 Wed Dec 30 13:24:40 MET 1998
[5.] Output of Oops.. message

This one happened when running the command:
scp /disk/other/largefile /var/tmp/
Notice I forgot to put the server name, so it just copies from a NFS mounted
drive to the local disk, but /var is a symbolic link on the nfs-root mounted
drive pointing to the localdisk /local.

dmesg:
general protection: 0000
CPU: 0
EIP: 0010:[<00125c04>]
EFLAGS: 00010206
eax: 5f657469 ebx: 00000302 ecx: 00fe9400 edx: 00000b43
esi: 00000302 edi: 00032841 ebp: 00000400 esp: 0106bd78
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process cp (pid: 1024, process nr: 30, stackpage=0106b000)
Stack: 00000847 001e0302 00000400 00032841 001e0008 00000b43 00126466 00000302
00032841 00000400 00000847 001e9c20 00000007 00032841 00000001 00157606
00000302 00032841 00000400 00032841 0106bef4 005013f8 001e9c20 00118bff
Call Trace: [<00126466>] [<00157606>] [<00118bff>] [<00159780>] [<00159c37>] [<00159cd4>] [<00159f5d>]
[<0015828d>] [<001655ec>] [<04000000>] [<00123eff>] [<0010ad4d>]
Code: 39 38 75 24 66 39 58 04 75 1e 39 68 20 74 22 56 e8 17 f9 ff

ksymoops:
>>EIP: 125c04 <get_hash_table+34/ac>
Trace: 126466 <getblk+36/3bc>
Trace: 157606 <ext2_new_block+866/984>
Trace: 118bff <do_bottom_half+3b/60>
Trace: 159780 <ext2_alloc_block+18c/19c>
Trace: 159c37 <block_getblk+bf/264>
Trace: 159cd4 <block_getblk+15c/264>
Trace: 159f5d <ext2_getblk+181/22c>
Trace: 15828d <ext2_file_write+185/45c>
Trace: 1655ec <nfs_file_read+a4/b0>
Trace: 4000000
Trace: 123eff <sys_write+153/18c>
Trace: 10ad4d <system_call+55/7c>

Code: 125c04 <get_hash_table+34/ac>
Code: 125c04 <get_hash_table+34/ac> 39 38 cmpl %edi,(%eax)
Code: 125c06 <get_hash_table+36/ac> 75 24 jne 125c2c <get_hash_table+5c/ac>
Code: 125c08 <get_hash_table+38/ac> 66 39 58 04 cmpw %bx,0x4(%eax)
Code: 125c0c <get_hash_table+3c/ac> 75 1e jne 125c2c <get_hash_table+5c/ac>
Code: 125c14 <get_hash_table+44/ac> 39 68 20 cmpl %ebp,0x20(%eax)
Code: 125c17 <get_hash_table+47/ac> 74 22 je 125c35 <get_hash_table+65/ac>
Code: 125c19 <get_hash_table+49/ac> 56 pushl %esi
Code: 125c1a <get_hash_table+4a/ac> e8 17 f9 ff 00 call fff92c <_EIP+0xfff92c>
Code: 125c25 <get_hash_table+55/ac> 90 nop
Code: 125c26 <get_hash_table+56/ac> 90 nop
Code: 125c27 <get_hash_table+57/ac> 90 nop

Here is another one, apparently from a crash initiated by syslogd:
dmesg:
general protection: 0000
CPU: 0
EIP: 0010:[<00125c04>]
EFLAGS: 00010286
eax: 9000ffff ebx: 00520302 ecx: 0000030c edx: 00000b45
esi: 00000302 edi: 00010847 ebp: 00000400 esp: 004c8cc0
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process syslogd (pid: 326, process nr: 15, stackpage=004c8000)
Stack: 0052bf18 001e0302 00000400 00010847 00000000 00000b45 00126466 00000302
00010847 00000400 0052bf18 001e9c20 00000007 00010847 00000001 00157606
00000302 00010847 00000400 000106af 004c8e50 004aa7f8 001e9c20 0014fe5f
Call Trace: [<00126466>] [<00157606>] [<0014fe5f>] [<00159780>] [<00126466>] [<00159cd4>] [<00159ee0>]
[<0015828d>] [<001762ca>] [<001764a5>] [<00174eee>] [<001797b8>] [<0017948c>] [<001762ca>] [<0012423c>]
[<00158108>] [<00126b6e>] [<0015a6a9>] [<0012434f>] [<0010ad4d>]
Code: 39 38 75 24 66 39 58 04 75 1e 39 68 20 74 22 56 e8 17 f9 ff
ksymoops:
>>EIP: 125c04 <get_hash_table+34/ac>
Trace: 126466 <getblk+36/3bc>
Trace: 157606 <ext2_new_block+866/984>
Trace: 14fe5f <udp_rcv+3bb/3d0>
Trace: 159780 <ext2_alloc_block+18c/19c>
Trace: 126466 <getblk+36/3bc>
Trace: 159cd4 <block_getblk+15c/264>
Trace: 159ee0 <ext2_getblk+104/22c>
Trace: 15828d <ext2_file_write+185/45c>
Trace: 1762ca <ide_do_request+1ba/66c>
Trace: 1764a5 <ide_do_request+395/66c>
Trace: 174eee <ide_set_handler+26/2c>
Trace: 1797b8 <triton_dmaproc+110/150>
Trace: 17948c <dma_intr>
Trace: 1762ca <ide_do_request+1ba/66c>
Trace: 12423c <do_readv_writev+234/274>
Trace: 15828d <ext2_file_write+185/45c>
Trace: 126b6e <__brelse+22/44>
Trace: 15a6a9 <ext2_update_inode+2cd/2dc>
Trace: 12434f <sys_writev+77/a8>
Trace: 10ad4d <system_call+55/7c>

Code: 125c04 <get_hash_table+34/ac>
Code: 125c04 <get_hash_table+34/ac> 39 38 cmpl %edi,(%eax)
Code: 125c06 <get_hash_table+36/ac> 75 24 jne 125c2c <get_hash_table+5c/ac>
Code: 125c08 <get_hash_table+38/ac> 66 39 58 04 cmpw %bx,0x4(%eax)
Code: 125c0c <get_hash_table+3c/ac> 75 1e jne 125c2c <get_hash_table+5c/ac>
Code: 125c14 <get_hash_table+44/ac> 39 68 20 cmpl %ebp,0x20(%eax)
Code: 125c17 <get_hash_table+47/ac> 74 22 je 125c35 <get_hash_table+65/ac>
Code: 125c19 <get_hash_table+49/ac> 56 pushl %esi
Code: 125c1a <get_hash_table+4a/ac> e8 17 f9 ff 00 call fff92c <_EIP+0xfff92c>
Code: 125c25 <get_hash_table+55/ac> 90 nop
Code: 125c26 <get_hash_table+56/ac> 90 nop
Code: 125c27 <get_hash_table+57/ac> 90 nop

The routine which crashes is "find_buffer" in "fs/buffer.c", the hash_table
of buffers seems to be corrupted with random junk, since eax contains
different values for each oops. Can it be due to a lost connection to the NFS
server, which then screws something up?
We see a lot of messages like this one, because of heavy traffic on our
network:
NFS server 172.16.0.1 not responding, still trying.
NFS server 172.16.0.1 OK.

[7.] Environment
[7.1.] Software
Redhat 5.2
[7.2.] Processor information
PPro 200, P2 400, P2 450. No SMP machines
[7.5.] Other information that might be relevant to the problem
NIC: eepro100
[X.] Other notes, patches, fixes, workarounds:
I haven't yet tried 2.0.37 and 2.2.x will have to wait a bit.

Cheers,
Brian

P.S.
Please cc: any answers to me as I am not on the mailing list.

--
Brian Aagaard Petersen             Email: aagaard@nbi.dk
High Energy Physics                Phone: +45 35 325 271
Niels Bohr Institute

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/