2.0.33p3: system freeze+oops

Karsten Weiss (karsten@addx.au.s.shuttle.de)
Mon, 15 Dec 1997 01:31:56 +0100 (MET)


Hi kernel hackers!

First of all my setup:

Genuine Intel 486DX4/100
ASUS SP3G
SoundBlaster 16
NE2000 (ISA)

Main memory size: 48 Mbytes
1 GenuineIntel 486 processor
2 16550A serial ports
1 vga+ graphics device
1 keyboard
SCSI devices:
IBM OEM 0662S12
IBM DPES-31080
SONY CD-ROM CDU-8003A
HP HP35480A
PCI bus devices:
VGA compatible device: S3 Inc. Vision 864-P (rev 0).
Non-VGA device: Intel 82378IB (rev 3).
Non-VGA device: NCR 53c810 (rev 1).
Non-VGA device: Intel 82424ZX Saturn (rev 4).

I'm currently using linux 2.0.33p3 (compiled with gcc-2.7.2.1),
libc 5.4.38 and XFree86-S3-3.3.1 on a RedHat 4.2 system (all update
PMs applied). The machine has been rock-solid for *YEARS* now and
I'm using it several hours each day (see my comment about the RAM
configuration change at the end of this mail).

In the past few weeks, however, I had two full freezes in X
(using either 2.0.31 or a 2.0.31 prepatch - I can't remember exactly).
The freeze NEVER occured with 2.0.32. Today, though, it happened for a
third time with 2.0.33p3. With "freeze" I mean a complete lock-up.
The system doesn't even reply pings from my brother's computer. There
was no OOPS and no syslog entry. The only pattern I can see is that I
always had several Netscapes (3.10) running when the freeze happened.
Today it happened for the third time right after a configure run of
the latest gtk+-0.99.0 was finished (and using Netscape).

Right after the third freeze I pressed the reset button. After rebooting
I restarted the gtk+ configure run. This time I was working in the
console and guess what: The system freeze happened again - for the first
time in the console! Nothing else was running at this time. I don't
know if this is the same kind of freeze I had before but anyway here's
what I got:

checking whether build environment is sane... segment not present: 0103
CPU: 0
EIP: 0010:[<0010974c>]
EFLAGS: 00010246
eax: 00000002 ebx: 00008220 ecx: fffffc18 edx: 001b1f5c
esi: 001b1784 edi: 00000000 ebp: 00009000 esp: 001b1738
ds: 0018 es: 0018 fs: 002b gs: 0018 ss: 0018
Process swapper (pid: 0, process nr: 0, stackpage=001af7a8)
Stack: 001b1f5c 0010a845 00000100 00109410 0000001f 001b1784 00000000 00009000
ffffffda 00000018 00000018 00100018 00190018 00000070 001090b7 00000010
00000246 0010927d 00000000 7f6e6547 0009e200 00101ffe 00000000 001aeea8
CallTrace: [<0010a845>] [<00109410>] [<00190018>] [<0010927d>]
Code: 83 3d 94 f7 1a 00 00 74 02 31 db e8 24 88 00 00 eb aa 89 f6
kfree of non-kmalloced memory: 001b17f0, next= 00000000, order=0
kfree of non-kmalloced memory: 001b17e0, next= 00000000, order=0
kfree of non-kmalloced memory: 001b1cf4, next= 00000000, order=0
idle task may not sleep
idle task may not sleep
idle task may not sleep
idle task may not sleep
idle task may not sleep

(I wrote this on a piece of paper and hope that all numbers are correct!)

001096e0 <sys_idle>:
1096e0: 53 pushl %ebx
1096e1: 31 db xorl %ebx,%ebx
1096e3: a1 98 27 1d 00 movl 0x1d2798,%eax
1096e8: 83 78 6c 00 cmpl $0x0,0x6c(%eax)
1096ec: 74 12 je 109700 <sys_idle+20>
1096ee: b8 ff ff ff ff movl $0xffffffff,%eax
1096f3: 5b popl %ebx
1096f4: c3 ret
1096f5: 8d 74 26 00 leal 0x0(%esi,1),%esi
1096f9: 8d bc 27 00 00 leal 0x0(%edi,1),%edi
1096fe: 00 00
109700: c7 40 04 9c ff movl $0xffffff9c,0x4(%eax)
109705: ff ff
109707: 90 nop
109708: 85 db testl %ebx,%ebx
10970a: 75 06 jne 109712 <sys_idle+32>
10970c: 8b 1d 40 23 1b movl 0x1b2340,%ebx
109711: 00
109712: a1 40 23 1b 00 movl 0x1b2340,%eax
109717: 29 d8 subl %ebx,%eax
109719: 83 f8 21 cmpl $0x21,%eax
10971c: 76 12 jbe 109730 <sys_idle+50>
10971e: e8 7d ff ff ff call 1096a0 <hard_idle>
109723: eb 27 jmp 10974c <sys_idle+6c>
109725: 8d 74 26 00 leal 0x0(%esi,1),%esi
109729: 8d bc 27 00 00 leal 0x0(%edi,1),%edi
10972e: 00 00
109730: 80 3d a3 ee 1a cmpb $0x0,0x1aeea3
109735: 00 00
109737: 74 13 je 10974c <sys_idle+6c>
109739: 83 3d c0 e7 1a cmpl $0x0,0x1ae7c0
10973e: 00 00
109740: 75 0a jne 10974c <sys_idle+6c>
109742: 83 3d 94 f7 1a cmpl $0x0,0x1af794
109747: 00 00
109749: 75 0a jne 109755 <sys_idle+75>
10974b: f4 hlt
10974c: 83 3d 94 f7 1a cmpl $0x0,0x1af794
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
109751: 00 00
109753: 74 02 je 109757 <sys_idle+77>
109755: 31 db xorl %ebx,%ebx
109757: e8 24 88 00 00 call 111f80 <schedule>
10975c: eb aa jmp 109708 <sys_idle+28>
10975e: 89 f6 movl %esi,%esi

Here's the kernel source of the asm code:

asmlinkage int sys_idle(void)
{
unsigned long start_idle = 0;

if (current->pid != 0)
return -EPERM;
/* endless idle loop with no priority at all */
current->counter = -100;
for (;;)
{
/*
* We are locked at this point. So we can safely call
* the APM bios knowing only one CPU at a time will do
* so.
*/
if (!start_idle)
start_idle = jiffies;
if (jiffies - start_idle > HARD_IDLE_TIMEOUT)
{
hard_idle();
}
else
{
if (hlt_works_ok && !hlt_counter && !need_resched)
__asm__("hlt");
}
!!!!!!!!!-> if (need_resched)
start_idle = 0;
schedule();
}
}

These are the functions of the CallTrace:

CallTrace: [<0010a845>] [<00109410>] [<00190018>] [<0010927d>]

0010a845: system_call+0x55 (system_call = 0010a7f0)
00109410: init
00190018: calc_vol+0x68 (calc_vol = 0018ffb0)
0010927d: start_kernel+0x1ad (start_kernel = 001090d0)

I upgraded from 24 to 48 MB some time ago *BEFORE* the freezes happened
for the first time. Could bad SIMMs be the cause of this problem?
Actually, I fear this is the case as there doesn't seem to be an
obvious bug in the above code - at least not at the EIP address.
But why are there "kfree of non-kmalloced memory" messages?

Another observation: I just noticed that there are three remaining
files in /tmp from the configure run just before the freeze (I don't
know if it's from the first or the second configure run):

-rw-r--r-- 1 root root 208 Dec 14 22:36 cca04047.i
-rw-r--r-- 1 root root 1728 Dec 14 22:36 cca04047.s
-rw-r--r-- 1 root root 2108 Dec 14 22:36 cca040471.o

The funny thing is that those files don't contain any code but parts
of e-mails and news postings that I've read before the freeze!

Could this be an indication of buffer cache trashing? Or is this
just the result of written meta data and not written data?

If you need more information feel free to mail me!

Good night,

Karsten Weiss UUCP: karsten@addx.au.s.shuttle.de
>ASK FOR PGP KEY< INTERNET: knweiss@trick.informatik.uni-stuttgart.de