Re: 2.0.33p3: system freeze+oops

Matthew Hyne (mjhyne@cleveland.com.au)
Mon, 15 Dec 1997 14:37:13 +1000


If you have power management turned on (aka APM), turn it off and see how
you go... Looks like the APM code make be breaking your system

Matt

At 01:31 15/12/97 +0100, Karsten Weiss wrote:
>Hi kernel hackers!
>
>First of all my setup:
>
>Genuine Intel 486DX4/100
>ASUS SP3G
>SoundBlaster 16
>NE2000 (ISA)
>
>Main memory size: 48 Mbytes
>1 GenuineIntel 486 processor
>2 16550A serial ports
>1 vga+ graphics device
>1 keyboard
>SCSI devices:
> IBM OEM 0662S12
> IBM DPES-31080
> SONY CD-ROM CDU-8003A
> HP HP35480A
>PCI bus devices:
> VGA compatible device: S3 Inc. Vision 864-P (rev 0).
> Non-VGA device: Intel 82378IB (rev 3).
> Non-VGA device: NCR 53c810 (rev 1).
> Non-VGA device: Intel 82424ZX Saturn (rev 4).
>
>I'm currently using linux 2.0.33p3 (compiled with gcc-2.7.2.1),
>libc 5.4.38 and XFree86-S3-3.3.1 on a RedHat 4.2 system (all update
>PMs applied). The machine has been rock-solid for *YEARS* now and
>I'm using it several hours each day (see my comment about the RAM
>configuration change at the end of this mail).
>
>In the past few weeks, however, I had two full freezes in X
>(using either 2.0.31 or a 2.0.31 prepatch - I can't remember exactly).
>The freeze NEVER occured with 2.0.32. Today, though, it happened for a
>third time with 2.0.33p3. With "freeze" I mean a complete lock-up.
>The system doesn't even reply pings from my brother's computer. There
>was no OOPS and no syslog entry. The only pattern I can see is that I
>always had several Netscapes (3.10) running when the freeze happened.
>Today it happened for the third time right after a configure run of
>the latest gtk+-0.99.0 was finished (and using Netscape).
>
>Right after the third freeze I pressed the reset button. After rebooting
>I restarted the gtk+ configure run. This time I was working in the
>console and guess what: The system freeze happened again - for the first
>time in the console! Nothing else was running at this time. I don't
>know if this is the same kind of freeze I had before but anyway here's
>what I got:
>
>checking whether build environment is sane... segment not present: 0103
>CPU: 0
>EIP: 0010:[<0010974c>]
>EFLAGS: 00010246
>eax: 00000002 ebx: 00008220 ecx: fffffc18 edx: 001b1f5c
>esi: 001b1784 edi: 00000000 ebp: 00009000 esp: 001b1738
>ds: 0018 es: 0018 fs: 002b gs: 0018 ss: 0018
>Process swapper (pid: 0, process nr: 0, stackpage=001af7a8)
>Stack: 001b1f5c 0010a845 00000100 00109410 0000001f 001b1784 00000000
00009000
> ffffffda 00000018 00000018 00100018 00190018 00000070 001090b7
00000010
> 00000246 0010927d 00000000 7f6e6547 0009e200 00101ffe 00000000
001aeea8
>CallTrace: [<0010a845>] [<00109410>] [<00190018>] [<0010927d>]
>Code: 83 3d 94 f7 1a 00 00 74 02 31 db e8 24 88 00 00 eb aa 89 f6
>kfree of non-kmalloced memory: 001b17f0, next= 00000000, order=0
>kfree of non-kmalloced memory: 001b17e0, next= 00000000, order=0
>kfree of non-kmalloced memory: 001b1cf4, next= 00000000, order=0
>idle task may not sleep
>idle task may not sleep
>idle task may not sleep
>idle task may not sleep
>idle task may not sleep
>
>(I wrote this on a piece of paper and hope that all numbers are correct!)
>
>001096e0 <sys_idle>:
> 1096e0: 53 pushl %ebx
> 1096e1: 31 db xorl %ebx,%ebx
> 1096e3: a1 98 27 1d 00 movl 0x1d2798,%eax
> 1096e8: 83 78 6c 00 cmpl $0x0,0x6c(%eax)
> 1096ec: 74 12 je 109700 <sys_idle+20>
> 1096ee: b8 ff ff ff ff movl $0xffffffff,%eax
> 1096f3: 5b popl %ebx
> 1096f4: c3 ret
> 1096f5: 8d 74 26 00 leal 0x0(%esi,1),%esi
> 1096f9: 8d bc 27 00 00 leal 0x0(%edi,1),%edi
> 1096fe: 00 00
> 109700: c7 40 04 9c ff movl $0xffffff9c,0x4(%eax)
> 109705: ff ff
> 109707: 90 nop
> 109708: 85 db testl %ebx,%ebx
> 10970a: 75 06 jne 109712 <sys_idle+32>
> 10970c: 8b 1d 40 23 1b movl 0x1b2340,%ebx
> 109711: 00
> 109712: a1 40 23 1b 00 movl 0x1b2340,%eax
> 109717: 29 d8 subl %ebx,%eax
> 109719: 83 f8 21 cmpl $0x21,%eax
> 10971c: 76 12 jbe 109730 <sys_idle+50>
> 10971e: e8 7d ff ff ff call 1096a0 <hard_idle>
> 109723: eb 27 jmp 10974c <sys_idle+6c>
> 109725: 8d 74 26 00 leal 0x0(%esi,1),%esi
> 109729: 8d bc 27 00 00 leal 0x0(%edi,1),%edi
> 10972e: 00 00
> 109730: 80 3d a3 ee 1a cmpb $0x0,0x1aeea3
> 109735: 00 00
> 109737: 74 13 je 10974c <sys_idle+6c>
> 109739: 83 3d c0 e7 1a cmpl $0x0,0x1ae7c0
> 10973e: 00 00
> 109740: 75 0a jne 10974c <sys_idle+6c>
> 109742: 83 3d 94 f7 1a cmpl $0x0,0x1af794
> 109747: 00 00
> 109749: 75 0a jne 109755 <sys_idle+75>
> 10974b: f4 hlt
> 10974c: 83 3d 94 f7 1a cmpl $0x0,0x1af794
>^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 109751: 00 00
> 109753: 74 02 je 109757 <sys_idle+77>
> 109755: 31 db xorl %ebx,%ebx
> 109757: e8 24 88 00 00 call 111f80 <schedule>
> 10975c: eb aa jmp 109708 <sys_idle+28>
> 10975e: 89 f6 movl %esi,%esi
>
>Here's the kernel source of the asm code:
>
>asmlinkage int sys_idle(void)
>{
> unsigned long start_idle = 0;
>
> if (current->pid != 0)
> return -EPERM;
> /* endless idle loop with no priority at all */
> current->counter = -100;
> for (;;)
> {
> /*
> * We are locked at this point. So we can safely call
> * the APM bios knowing only one CPU at a time will do
> * so.
> */
> if (!start_idle)
> start_idle = jiffies;
> if (jiffies - start_idle > HARD_IDLE_TIMEOUT)
> {
> hard_idle();
> }
> else
> {
> if (hlt_works_ok && !hlt_counter && !need_resched)
> __asm__("hlt");
> }
>!!!!!!!!!-> if (need_resched)
> start_idle = 0;
> schedule();
> }
>}
>
>These are the functions of the CallTrace:
>
>CallTrace: [<0010a845>] [<00109410>] [<00190018>] [<0010927d>]
>
>0010a845: system_call+0x55 (system_call = 0010a7f0)
>00109410: init
>00190018: calc_vol+0x68 (calc_vol = 0018ffb0)
>0010927d: start_kernel+0x1ad (start_kernel = 001090d0)
>
>
>I upgraded from 24 to 48 MB some time ago *BEFORE* the freezes happened
>for the first time. Could bad SIMMs be the cause of this problem?
>Actually, I fear this is the case as there doesn't seem to be an
>obvious bug in the above code - at least not at the EIP address.
>But why are there "kfree of non-kmalloced memory" messages?
>
>Another observation: I just noticed that there are three remaining
>files in /tmp from the configure run just before the freeze (I don't
>know if it's from the first or the second configure run):
>
>-rw-r--r-- 1 root root 208 Dec 14 22:36 cca04047.i
>-rw-r--r-- 1 root root 1728 Dec 14 22:36 cca04047.s
>-rw-r--r-- 1 root root 2108 Dec 14 22:36 cca040471.o
>
>The funny thing is that those files don't contain any code but parts
>of e-mails and news postings that I've read before the freeze!
>
>Could this be an indication of buffer cache trashing? Or is this
>just the result of written meta data and not written data?
>
>If you need more information feel free to mail me!
>
>Good night,
>
>Karsten Weiss UUCP: karsten@addx.au.s.shuttle.de
>>ASK FOR PGP KEY< INTERNET: knweiss@trick.informatik.uni-stuttgart.de
>
>
>
>