Re: Suggestions with hard lockup on 4 systems, have oops report

From: Adam Kropelin
Date: Fri Jul 16 2004 - 12:35:11 EST


On Fri, Jul 16, 2004 at 11:01:39AM -0400, Brian McEntire wrote:
> Thank you for taking time from your busy days to read this. You all
> (kernel maintainers) rock! :)
>
> I have four Linux hosts, with identical hardware and OSs, that exhibit a
> very tough to troubleshoot hang/freeze. About once every two weeks (and

<snip>

> The OS specifics:
> RH 7.2 with latest patches except running kernel 2.4.9-31enterprise for
> CM reasons (at one point, I tried the latest available RH 7.2 kernel but
> it did not improve stability so I went back.)
> bcm5700-7.1.22-1
> nvidia ?? (no RPM listed, didn't know where to find the version.)

You've really got to eliminate the binary bcm5700 and nvidia modules in
order to diagnose this. Based on the oops, bcm5700 looks suspect, but it
could just be the unlucky guy whose memory was stepped on by nvidia or
some other part of the kernel.

Switch to an open NIC like e1000 temporarily (or better yet,
permanently) and see if the lockup persists. Do the same with nvidia. If
you can reproduce the problem without ever having loaded either module
(unloading the module once it's loaded is not sufficient), post the new
oops and you'll have a solid foundation for debugging.

--Adam

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/