Re: NUMA allocator on Opteron systems does non-local allocation on node0

From: Oliver Weihe
Date: Mon Oct 13 2008 - 11:27:44 EST

Next message: Haavard Skinnemoen: "Re: [PATCH] Make ATNGW100 serial ports configurable"
Previous message: Ingo Molnar: "Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the"hwclock" program"
In reply to: Andi Kleen: "Re: NUMA allocator on Opteron systems does non-local allocation on node0"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,

it seems that my reproducer is not very good. :(
It "works" much better when you start several processes at once.

for i in `seq 0 3`
do
numactl --cpunodebind=${i} ./app &
done
wait

"app" still allocates some memory (7GiB per process) and fills the array
with data.

I've noticed this behaviour during some HPL (Linpack benchmark from/for
top500.org) runs. For small data sets there's no difference in speed
between the kernels while for big data sets (allmost the whole memory)
2.6.23 and newer kernels are slower than 2.6.22.
I'm using OpenMPI with the runtime option "--mca mpi_paffinity_alone 1"
to pin each process on a specific CPU.

The bad news is: I can crash allmost every Quadcore Opteron system with
kernels 2.6.21.x to 2.6.24.x with "parallel memory allocation and
filling the memory with data" (parallel means: there is one process per
core doing this). While it takes some time on dualsocket machines it
takes often less than 1 minute on quadsocket quadcores until the system
freezes.
Yust for the case it is some vendor specific BIOS bug: we're using
supermicro mainboards.

> [Another copy of the reply with linux-kernel added this time]
>
> > In my setup I'm allocating an array of ~7GiB memory size in a
> > singlethreaded application.
> > Startup: numactl --cpunodebind=X ./app
> > For X=1,2,3 it works as expected, all memory is allocated on the
> > local
> > node.
> > For X=0 I can see the memory beeing allocated on node0 as long as
> > ~3GiB
> > are "free" on node0. At this point the kernel starts using memory
> > from
> > node1 for the app!
>
> Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
>
> Normally there should be no protection on it, but perhaps something
> broke.
>
> What does cat /proc/sys/owmem_reserve_ratio say?

2.6.22.x:
# cat /proc/sys/vm/lowmem_reserve_ratio
256 256

2.6.23.8 (and above)
# cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32

> > For parallel realworld apps I've seen a performance penalty of 30%
> > compared to older kernels!
>
> Compared to what older kernels? When did it start?

I've tested some kernel Versions that I've laying around here...
working fine: 2.6.22.18-0.2-default (openSUSE) / 2.6.22.9 (kernel.org)
showing the described behaviour: 2.6.23.8; 2.6.24.4; 2.6.25.4; 2.6.26.5;
2.6.27

>
> -Andi
>
> --
> ak@xxxxxxxxxxxxxxx
>

--

Regards,
Oliver Weihe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Haavard Skinnemoen: "Re: [PATCH] Make ATNGW100 serial ports configurable"
Previous message: Ingo Molnar: "Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the"hwclock" program"
In reply to: Andi Kleen: "Re: NUMA allocator on Opteron systems does non-local allocation on node0"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]