Martin Bligh wrote:
We have to get to the bottom of this - there's a shadow over about 500
patches and we don't know which.
iirc I tried to reproduce this a couple of weeks back and failed.
Are you able to narrow it down to a particular LTP test? It was mtest01 or
something like that? Perhaps we can identify a particular command line
which triggers the fault in a standalone fashion?
The LTP output is here:
http://test.kernel.org/abat/33803/debug/test.log.1
The last test run was memset01
From a good test run (http://test.kernel.org/abat/33964/debug/test.log.1)
the one after memset01 is a second instance of the same.
Which is bad I suppose, in that it's likely an intermittent failure.
Perhaps you can try running memset01 in a loop? I don't have such a
box set up here right now, I'm afraid ... will see what I can do.
OTOH, it looks like this might be a different failure than the double
fault we saw in previous -mm's, which was consistently in mtest01, IIRC.
As a shot in the dark, I've seen problems on my Athlon 64 box with a program that does memset on a huge chunk of memory repeatedly which causes the machine to panic in various ways, lock up or reboot. Is this what that test is doing? I suspect my problem is caused by a AMD Athlon 64/Opteron CPU erratum 97 "128-Bit Streaming Stores May Cause Coherency Failure". The Newcastle CPU I have has this bug which can cause loss of coherency on non-temporal stores, which the glibc memset function uses. The BIOS is supposed to apply a workaround but I've no way of knowing if mine (Asus A8N-SLI Deluxe) is..
And no it's not a memory problem, the system passes memtest86 overnight without error. The problem usually shows up within a minute of starting the continuous-memset program..