2.1.128 Oops

Andy Higgins (higgins@ns.vvm.com)
Tue, 17 Nov 1998 00:35:52 -0600 (EST)


Hello,

We've been having continous lockups from 2.0.35-36 to 2.1.9X's to 2.1.128
with our SMP machines for the past 6-9 mos on Dual PPRO's, Dual PII 200's
and now Dual PII 400s (last 2 Intel boards..) : Symptoms are black
screen..lockups occur randomly sometimes a few days 2-3 with no lockup,
frequency of lockups increase it appears with increased network
traffic..not necessarily high load.

There are no warning..nothing of interest.. in logs..except TCP Checksump
errors and ip_rt_advice : redirect to x.x.x.x dropped. After re-compiling
the kernel with -g and waiting..through many black screen of deaths we
finally got a reproduced oops. Since the BSOD's don't show anything I can
only guess that these occasional oops _might_ be related and only
occasionally can the kernel squeeze out a last gasp of info before lockup.
(2.1.127-128) has oopsed rather than locked up enought times..leading me
to believe that it is just a tad bit better at handling whatever
situation this may be??)..who knows..

anyway..here are the stats of the oops on 2.1.128pre1 (latest kernel
attempted)

Unable to Handle Kernel Paging Request at Virtual Address: 60000060
Current->tss.cr3 = 00101000,%cr3=00101000
...
CPU:0
EIP: [<c0151c5c>]
...
aiee killing in handler
kernel panic: attemtping to kill idle taks in intterupt handler no
sycing..
...

(gdb) list *0xc0151c5c
0xc0151c5c is in ip_route_output (route.c:1468).
1463
1464 hash = rt_hash_code(daddr, saddr^(oif<<5), tos);
1465
1466 start_bh_atomic();
1467 for (rth=rt_hash_table[hash]; rth; rth=rth->u.rt_next) {
1468 if (rth->key.dst == daddr &&
1469 rth->key.src == saddr &&
1470 rth->key.iif == 0 &&
1471 rth->key.oif == oif &&
1472 #ifndef CONFIG_IP_TRANSPARENT_PROXY
(gdb)

Attempts to narrow the problem (thinking SMP code in network drivers
possibly problematic) are as follows:

Network Cards (Tulip): Have tried versions of tulip with Donald Becker's
latest drivers each time..same results

Failing that have tried using the onboard eepro100(Plus) with
"eepro100.c:v1.06 10/16/98 Donald Becker..which is the current card.

Attempted Different
SCSI Cards: Results:

BT950 BT94X.. ..lockups same (no oopses)
AIC7XXX (on board) ..same
EATA/DMA 2.0x: Copyright (C) 1994-1998 Dario Ballabio.
EATA0: 2.0C, PCI 0xfcf0, IRQ 11, BMST, SG 122, MB 64 (current system)

Note of interest, we have 4 identical machines on same physical ethernet
segment..and the other three experience the same lockups but much less
frequency. One of them Runs squid only.. (512MB SDRAM) 56GB Raid0
partiions EATA and EEPro.. Load is intense (dies once every 3-4 weeks)
but only one app..squid

The other runs sendmail,web,ftp and frontpage..with considerable load:
(and dies once every 2 weeks roughly) (black screen..no info in logs)

The third is a backup..constantly running tar's..heavy disk load..light
network load (dies once every 3-4 weeks or so)

The main one runs authentication, pop, sendmail and is the primary mail
server with a heavy mail load. (And this is the one that dies the
most..3-5 times a week) lately can't go more than 2 or 3 days during the
week..then it will die..30 minutes later..die again..repeating to or 3
times for about 3-4 hours..then it calms down.

Mother Boards: Intel 440BX PII 400 512Cache 256MB SDRAM

Current config is bare bones..ip no routing, no aliasing, no firewalling.
Have tried compiling network drivers as modules and built in..

Any help would be appreciated

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/