This has happened fairly consistently over about a week now. One of
the machines was massively upgraded about 48 hours before the first
freeze. The other machine was untouched. The upgraded machine is
running 2.0.29 + pre-patch-30, the other a 2.0.27 kernel.
The machines have different motherboards, ethernet cards (3c509 on
one, SMC ultra on the other), scsi controllers (buslogic on one,
ncr53c810 on the other).
There are 5 other machines in the same room, all unaffected. The
affected machines are PPro 200's, there is another PPro in the room,
that's unaffected.
I've tried both off the UPS, both on the UPS, and one on, one off. No
difference. I've tried unplugged keyboard and monitor from
both. ditto. The room is air-conditioned, normally kept at around 19dg
celcius. The cases for these machines have 3 fans forcing air.
The machines are running different applications. (one's doing bulk web
servers, the other is running squid).
Just to repeat. The 2nd machine that started freezing did so with NO
changes. No hardware, no software. It had been up for nearly 2 months
when the first freeze happened. Each machine has frozen 6 times since
the first one did it. they freeze at random times.
At this point, I'm fresh out of ideas, so I'd like to get some idea of
what the machine was doing when it froze. To that end, I'd like have
the PC updating on the screen regularly.
The obvious thing that springs to mind is to change the cli() macro to
poke the program counter straight into the screen memory when it's
called. This is almost nice and easy as such...
#define cli() ({ extern unsigned char * _origin; \
unsigned long pc = ???PC???; \
*__origin = pc; \
__asm__ __volatile__ ("cli": : :"memory") \
})
Problem is, how do I easily stuff the program counter into a variable?
this is going to take an assembly statment as far as I know, but my
i386 assembly is non-existant. So; any suggestion from an assembler
wizard?
Also, does anything rely on the flag remaining unchanged over a call
to cli()?
Thanks,
Michael.