Re: Something very strange on x86_64 2.6.X kernels

From: Andrew Morton
Date: Thu Jan 20 2005 - 16:11:04 EST


Eric Dumazet <dada1@xxxxxxxxxxxxx> wrote:
>
> Hi Andi
>
> I have very strange coredumps happening on a big 64bits program.
>
> Some background :
> - This program is multi-threaded
> - Machine is a dual Opteron 248 machine, 12GB ram.
> - Kernel 2.6.6 (tried 2.6.10 too but problems too)
> - The program uses hugetlb pages.
> - The program uses prefetchnta
> - The program uses about 8GB of ram.
>
> After numerous differents core dumps of this program, and gdb debugging
> I found :
>
> Every time the crash occurs when one thread is using some ram located at
> virtual address 0xffffe6xx

What does "using" mean? Is the program executing from that location?

> When examining the core image, the data saved on this page seems correct
> (ie countains coherent user data). But one register (%rbx) is usually
> corrupted and contains a small value (like 0x3c)
>
> The last instruction using this register is :
> prefetchnta 0x18(,%rbx,4)
>
>
> Examining linux sources, I found that 0xffffe000 is 'special' (ia 32
> vsyscall) and 0xffffe600 is about sigreturn subsection of this special area.
>
> Is it possible some vm trick just kicks in and corrupts my true 64bits
> program ?
>

Interesting. IIRC, opterons will very occasionally (and incorrectly) take
a fault when performing a prefetch against a dud pointer. The kernel will
fix that up. At a guess, I'd say tha the fixup code isn't doing the right
thing when the faulting EIP is in the vsyscall page.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/