This morning (5am here) I had a LDS on one of my servers, the first I
ever had, and that machine had 36 days uptime. I managed to get some info
from that.
So: preliminaries - machine details: Intel DX2/66, ASUS SP3G motherboard,
NCR53c810, 16 Mb RAM, two Cyclades boards on it. Stock linux kernel
1.2.13 (no patches, no nothing). At the time LDS occured, the machine was
absolutely idle - it is acting as a terminal server and nobody was
dialing in, and I'm also sure enough about the time when LDS occured - a
1min at maximum interval. I have a small program that prints on a
console the uptime info at one minute interval. I was watching that
console while taking my cofee - eh, after a looong working night - and
something appeared strange to me when the next line didn't show up at the
expected moment. Tryied to switch consoles; no luck. The system wasn't
responding to me at all. I waited five more minutes. Nothing.
Ctrl-Alt-Del. Nothing. So I think I had a LDS.
Using the ALT+ScrollLock I managed to get this data:
EIP possible values: (this is the order I'v got them:
EIP: 0010:001214EA EFLAGS: 00010006
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:0012FEE4 EFLAGS: 00010206
EIP: 0010:0012FEE4 EFLAGS: 00010206
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
EIP: 0010:001214E6 EFLAGS: 00010017
...and constant from now on......
System.map lookup:
EIP: 001214EA
EIP: 001214E6
---------------------------------------
0012129c T _free_pages
>>>> 0012145c T ___get_free_pages
001215ec T ___get_dma_pages
EIP: 0012FEE4
---------------------------------------
0012fde8 T _permission
>>>> 0012fea8 T _get_write_access
0012ff18 T _put_write_access
If more details are needed and somebody feels like wanting to dig into
this, please ask me. It looks to me as a memory-management problem, maybe
the experts here would like to have a word to say ...
I've applied a sane reboot to this machine and now I'm going home. Home...
Cristian Gafton
| Cristian Gafton, SysAdm gafton@cccis.sfos.ro
| -------------------------------------------------------------------
| Computers & Communications Center str. Moara de Foc nr. 35
| Phone: 40-32-252936, 252938 PO-BOX 2-549
| Fax: 40-32-252933 IASI 6600, ROMANIA
| ===================================================================
| Good code is hard to write, so it must be hard to understand.