Re: good 2.1.x SMP kernel is?

John Kennedy (jk@csuchico.edu)
Sat, 19 Dec 1998 10:28:34 -0800 (PST)


I'll sort of summarize and elaborate on what I said before, by way
of replies...

[me]
> I just got my new system up that uses a dual 450MHz pII on a
> ASUS P2B-DS. So far I've tried 2.1.130 (what I was using UP at the
> time and seemed stable enough), 2.1.131 (stock, -ac7 & -ac11) and now
> pre-2.1.132-2. All tend to croak unhelpfully, at least so far.

[John Allensworth]
> Dunno if it's related, but I posted a problem report earlier. Dual
> PII/300's on the Chaintech 6DBU board (equiv). I've had complete
> system locks like you mention below with UP 2.0.35 and 36, and SMP
> 2.1.131 stock, ac9, and ac13 SMP. My console apps are stable enough,
> but xemacs, netscape, and acroread all hose it. I was told to submit
> it as an X bug, but still haven't gotten any reports back :( I have
> a feeling my problem stems from some obscure incompatability with
> my vid card though...

I haven't gotten a running Xconfig for the new video card in that
system yet (Matrox G200, AGP connector) so I've been running it in just
the regular old console mode. I have been using X *clients* in some
cases (xcpustate, xdaliclock) since it is pretty obvious when the system
freezes with something updating pretty frequently.

[Jens Axboe]
> I don't have any good ideas, but I'm running the same motherboard
> as you and have never had any problems with 2.1.1xx kernels. Maybe
> your problem is hw related? Faulty RAM, remarked CPU's, etc?

How to tell? Most of the RAM problems you hear about tend to get
applications (core dumps from corruption, etc) a good chunk of the
time and I've failed a fair number of times without seeing that happen.
They're native 450MHz CPUs running at the right speed (not overclocked)
with the recommended cooling fans, etc.

Sitting idle, the box seems happy. In trying to syslog the problem, I
mounted all the writable ext2fs filesystems in sync mode. The box seems
to be doing the same tasks just fine, if one hell of a lot slower (what
you would expect for sync vs async).

It feels like a disk-related race to me. Lots of interrupts causing
it to freak, something.

[Eric Lee Green]
> What exactly are you doing right before it crashes? I have a customer
> with a similar problem under 2.1.131 with the same setup (except with
> an ICP-Vortex GDT RAID controller instead of the Adaptec SCSI on the
> P2B-DS), ... 2.1.131 kernel, SMP, the driver in 2.1.131 is
> ICP-Vortex's latest, the stock token ring driver, 1gb of memory,
> 72gb RAID partition.

I am using the native Adaptec controller, have about 27G total. One
9GB wide, one 18G LVD. I'm crunching up my homebrew linux distribution
via my build script, so basically lots of compiling.

Because my build script does the same thing every time and the last
few times I've knocked it back to stage 0, I can tell that it is crashing
in different places compiling different packages. Once, when I didn't
restart it from stage 0, it looked like it was briefly in a tight loop
throwing out a bunch of `sync' commands into the background when it bailed
(so not much actual I/O at all). Since it looks like I'm going to have
to be on the console with the screen unblanked and watching to catch it,
that is going to be the first thing I chase. (:

Again, it smells like a race to me. One compile isn't that different
from another, some will take the box out some won't. After it reboots
and you do the same thing over again, it will crash in a different place.
It will just reboot or seize up -- no CTRL-ALT-DELETE will fix it and
the console won't wake up.

Running a full build on a synced fs is sort of an interesting test, but
I'm obviously not stressing the CPU at all and things certainly aren't
being very speedy. I'm probably not exercising the problem or otherwise
avoiding the race.

Another test would be to pull out a CPU, run it with both UP & SMP
kernels and rebuild again, once per CPU, and see if it fails. That ought
to exercise most of the components. If I grind away that far without
finding anything, I'll probably pull out individual RAM modules (I have
2x128MB) and see if that makes a difference. It isn't misbehaving like
I see most people complain about when they have RAM problems though.

I have tested it for *really* long periods of time playing Half-Life (:
under win98. That will only abuse one CPU, but didn't see anything that
didn't look like a game-related problem (CTRL-ALT-DELETE gets you the
standard "nuke task" menu and you can recover).
--- john

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/