From: Andy Whitcroft
Date: Mon Jun 11 2007 - 09:59:25 EST
Andy Whitcroft wrote:
> H. Peter Anvin wrote:
>> Andy Whitcroft wrote:
>>>> It definitely sounds like a memory clobber of some sort.
>>>> Usual suspects, in addition to the input/output buffers you already
>>>> looked at, would be the heap and the stack. Finding where the stack
>>>> pointer lives would be my first, instinctive guess.
>>> The stack seems to be where it should be and seems to stay pretty much
>>> in the same place as it should. Adding checks for the heap also seem to
>>> stay within bounds. I've tried making the stack and the heap 64k to no
>>> Moving the kernel to other places in memory seems to kill the decode
>>> completely during gunzip() which may be a hint I am not sure.
>>> This thing is trying to ruin my mind.
>> Yours and mine both. Seems like *something* is clobbering memory, but
>> what and why is a mystery. The fact that putting the kernel in a higher
>> point in memory is a good indication that this clobber is at a
>> relatively high address.
>> How much RAM does this machine have?
> This is as 12GB machine. 3 numa nodes.
> I checked out the location of the IDT and GDT and both seem sane, in the
> 9xxxx range below the kernel destination.
> I also note that on another machine of this type, one Node only in that
> case some of the "did work" cases do not work. Also when I applied some
> of my patches on the top "working" cases stopped working. So whatever
> it is is definatly related to the shape of the kernel to be loaded.
> Very confusing.
Ok, in fact when the kernel is moved elsewhere in the address space it
will decode properly. There was a check in there for not loading at the
right address which was catching me out ... as errors do not show up as
we have no serial support. Doh.
Once I had gotten this decoding at other addresses I simply tried moving
the base address for the kernel elsewhere. I am able to successfully
boot the kernel at 16MB and 256MB. This seems like something outside
the decoder scribbling.
I would not normally recommend moving the base address of the kernel.
However, this problem at least so far has only shown up on the NUMA-Q
platform which is at best described as a very small volume
sub-architecture. There are areas in which it differers from mainstream
BIOS and we are no longer able to get details of these differences.
We actually have no proof as yet this is or is not a NUMA-Q specific
problem. For instance these machines tend to run less modules and more
builtin stuff than the average due to an owner dislike of modules. So
we could have a lurking kernel size issue or similar.
I am therefore proposing change the base address for NUMA-Q only (patch
to follow this email). And that we remain aware of the issue and on the
lookout for similar breakage on mainstream x86 platforms. At least with
this patch we can get wider testing on the rest of the kernel.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/