Trying Again: RAID + Network == OOPS

Derek Glidden (dglidden@illusionary.com)
Sat, 16 Oct 1999 22:12:43 -0400


Let's try this again; I've posted the problem a couple times now with no
real feedback. As far as I can tell, it's a legitimate problem, so I'll
keep trying...

My file server works fine as long as there is no RAID and Network
activity happening simultaneously; NFS or other network activity to the
RAID device seems to cause crashes. Primarily the NFS processes cause
OOPSes, although I've also seen an 'ftpd' OOPS and screw up the
machine. The RAID device is RAID0.

I say it seems to work OK without network access because running fsck
against the RAID device takes a good hour of grinding all three drives
simultaneously and I have yet to see it crash doing that. I'm also able
to back up the RAID volume to tape consistently without problems.
Network traffic to non-RAID partitions don't seem to cause problems, but
it might just be that I haven't seen any problems yet. A large amount
of traffic over the network to the RAID device, either from one machine
seriously pounding it, or several machines accessing it simultaneously,
will eventually cause an OOPS. It doesn't seem to matter if it's read
or write activity to cause problems. (Strangely, however, running large
mirror or rsync jobs from the local console that go out over the network
and suck data down onto the RAID partition don't seem to give it
problems. Only when other machines come into the box to access it
remotely does it consistently OOPS.)

It's a machine based on RedHat 6.0 with all appropriate patches, but
I've replaced the kernel many times, including updating to the latest
knfsd patches and even dumping knfsd entirely and replacing with the
"old-fashioned" nfs-server 2.2beta46. It's a P200MMX with 64MB of RAM
and a Promise Ultra33 PDC20246 PCI IDE controller that has three Maxtor
IDE drives (three 91303D6) being used as a 36GB RAID array. It also has
an Intel PCI EEPro 10/100 running at 100Mbps, an old ISA ET4000 VGA
card, and a PCI Adaptec AIC-7850 SCSI card controlling a 4GB DAT. I've
already replaced motherboard, CPU and RAM and the crashes continue to
happen.

At the same time I added the Promise controller and three drives to
create the RAID device, I re-installed the machine with RedHat 6.0.
Previous to that it had been running an older RedHat (5.0? 5.1?) that
was 2.0 kernel and glibc 2.0 based and very stable. This makes me
suspect that the problems are caused by some interaction between the
network and either the Promise driver or the RAID driver, since, as I've
said, without network activity, the machine is solid, and without
accessing the RAID device over the network, the machine is solid.

According to /proc/interrupts, there are no IRQ conflicts with any of
the devices on the machine.

I think I've tried just about every version of the kernel from 2.2.5 on,
including most of Alan's 'ac' patches, and nothing helps. These latest
OOPSes happened without knfsd patches applied, but with latest RAID
patches on an otherwise "stock" 2.2.13pre17 kernel. If anything, it
seems more recent kernel versions might have more frequent problems, but
that might just be because I've been trying harder lately to try to
pinpoint the problem and therefore causing more problems.

Here is what "ksymoops" translated from the last two OOPS messages, the
second of which immediately followed the first:

ksymoops 0.7c on i586 2.2.13pre17. Options used
-V (specified)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.2.13pre17/ (default)
-m /usr/src/linux/System.map (default)

EIP: %04x:[<%08lx>] ESP: %04x:%08lx EFLAGS: %08lx
EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx
ESI: %08lx EDI: %08lx EBP: %08lx DS: %04x ES: %04x
CR0: %08lx CR2: %08lx CR3: %08lx
EIP: %04x:[<%08lx>]
EFLAGS: %08lx
eax: %08lx ebx: %08lx ecx: %08lx edx: %08lx
esi: %08lx edi: %08lx ebp: %08lx esp: %08lx
ds: %04x es: %04x ss: %04x
Process %s (pid: %d, process nr: %d, stackpage=%08lx)
Stack:
Call Trace: [<%08lx>]
Code: %02x
Ok<1>Unable to handle kernel NULL pointer dereference<1>Unable to handle
kernel paging request at virtual address %08lx
<1>current->tss.cr3 = %08lx, %%cr3 = %08lx
<1>*pde = %08lx
<0>Kernel panic: %s
<0>In swapper task - not syncing
Aiee, killing interrupt handler
<3>iput: Aieee, semaphore in use inode %s/%ld, count=%d
<3>iput: Aieee, atomic write semaphore in use inode %s/%ld, count=%d
exp_rootfh: Aieee, NULL dentry
exp_rootfh: Aieee, NULL d_inode
exp_rootfh: Aieee, ino/dev mismatch
Internal error %s %d
Internal error in file %s, line %d.
IRQ: %d
internal error: cmd=%02x != %02x=(vdsp[0] >> 24)
Unable to handle kernel paging request at virtual address de142e65
current->tss.cr3 = 03685000, %cr3 = 03685000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[udp_sendmsg+798/812]
EFLAGS: 00010206
eax: c37ef320 ebx: 00000000 ecx: c39018e0 edx: 0000ee82
esi: 00000000 edi: c3579360 ebp: c3689eec esp: c3689e28
ds: 0018 es: 0018 ss: 0018
Process rpc.nfsd (pid: 410, process nr: 18, stackpage=c3689000)
Stack: c016c09c c35e24dc c37ef320 00000000 00000068 c37ef320 1e030108
bb476800
1e0aa8c0 5a0aa8c0 c3689ee0 cd113510 5a0aa8c0 00000000 00000000
c016c11f
c3579360 c3689eec 00000060 c3689f08 c3689eec c014fffc c35e24dc
c3689eec
Call Trace: [inet_sendmsg+0/144] [inet_sendmsg+131/144]
[sock_sendmsg+136/172] [inet_sendmsg+0/144] [sys_sendto+199/236]
[free_wait+99/108] [do_select+509/532]
Code: 8b 44 24 50 5b 5e 5f 5d 83 c4 34 c3 89 f6 53 8b 4c 24 08 8b
Using defaults from ksymoops -t elf32-i386 -a i386

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: 8b 44 24 50 movl 0x50(%esp,1),%eax
Code; 00000004 Before first symbol
4: 5b popl %ebx
Code; 00000005 Before first symbol
5: 5e popl %esi
Code; 00000006 Before first symbol
6: 5f popl %edi
Code; 00000007 Before first symbol
7: 5d popl %ebp
Code; 00000008 Before first symbol
8: 83 c4 34 addl $0x34,%esp
Code; 0000000b Before first symbol
b: c3 ret
Code; 0000000c Before first symbol
c: 89 f6 movl %esi,%esi
Code; 0000000e Before first symbol
e: 53 pushl %ebx
Code; 0000000f Before first symbol
f: 8b 4c 24 08 movl 0x8(%esp,1),%ecx
Code; 00000013 Before first symbol
13: 8b 00 movl (%eax),%eax

ksymoops 0.7c on i586 2.2.13pre17. Options used
-V (specified)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.2.13pre17/ (default)
-m /usr/src/linux/System.map (default)

EIP: %04x:[<%08lx>] ESP: %04x:%08lx EFLAGS: %08lx
EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx
ESI: %08lx EDI: %08lx EBP: %08lx DS: %04x ES: %04x
CR0: %08lx CR2: %08lx CR3: %08lx
EIP: %04x:[<%08lx>]
EFLAGS: %08lx
eax: %08lx ebx: %08lx ecx: %08lx edx: %08lx
esi: %08lx edi: %08lx ebp: %08lx esp: %08lx
ds: %04x es: %04x ss: %04x
Process %s (pid: %d, process nr: %d, stackpage=%08lx)
Stack:
Call Trace: [<%08lx>]
Code: %02x
Ok<1>Unable to handle kernel NULL pointer dereference<1>Unable to handle
kernel paging request at virtual address %08lx
<1>current->tss.cr3 = %08lx, %%cr3 = %08lx
<1>*pde = %08lx
<0>Kernel panic: %s
<0>In swapper task - not syncing
Aiee, killing interrupt handler
<3>iput: Aieee, semaphore in use inode %s/%ld, count=%d
<3>iput: Aieee, atomic write semaphore in use inode %s/%ld, count=%d
exp_rootfh: Aieee, NULL dentry
exp_rootfh: Aieee, NULL d_inode
exp_rootfh: Aieee, ino/dev mismatch
Internal error %s %d
Internal error in file %s, line %d.
IRQ: %d
internal error: cmd=%02x != %02x=(vdsp[0] >> 24)
Unable to handle kernel paging request at virtual address 8b0c7536
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[remove_from_queues+169/328]
EFLAGS: 00010282
eax: 8b0c7502 ebx: 00000000 ecx: c2b85de0 edx: c3fbe388
esi: c2b85de0 edi: 00000002 ebp: 00000008 esp: c3fe7fac
ds: 0018 es: 0018 ss: 0018
Process kupdate (pid: 3, process nr: 3, stackpage=c3fe7000)
Stack: 00000002 c012437d c2b85de0 c2b85480 000000e3 00000001 c0125285
c2b85de0
c3fe6000 c01db04b c3fe61c2 00001000 c2b85de0 c01256c8 00000f00
c3ff9fc0
c0106000 c010651f 00000000 00000f00 c0221fd8
Call Trace: [refile_buffer+77/184] [sync_old_buffers+149/400]
[tvecs+11243/12832] [kupdate+112/116] [get_options+0/116]
[kernel_thread+35/48]
Code: 89 50 34 c7 01 00 00 00 00 89 02 c7 41 34 00 00 00 00 ff 0d
Using defaults from ksymoops -t elf32-i386 -a i386

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: 89 50 34 movl %edx,0x34(%eax)
Code; 00000003 Before first symbol
3: c7 01 00 00 00 00 movl $0x0,(%ecx)
Code; 00000009 Before first symbol
9: 89 02 movl %eax,(%edx)
Code; 0000000b Before first symbol
b: c7 41 34 00 00 00 00 movl $0x0,0x34(%ecx)
Code; 00000012 Before first symbol
12: ff 0d 00 00 00 00 decl 0x0


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
With Microsoft products, failure is not Derek Glidden
an option - it's a standard component. http://3dlinux.org/
Choose your life. Choose your http://www.tbcpc.org/
future. Choose Linux. http://www.illusionary.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/