Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad forfile/email/webservers

From: Christoph Lameter
Date: Tue Sep 28 2010 - 08:35:25 EST


On Tue, 28 Sep 2010, Robert Mueller wrote:

> How would the ACPI information actually be changed?

Fix the BIOS SLIT distance tables.

> I ran numactl -H to get the hardware information, and that seems to
> include distances. As mentioned previously, this is a very standard
> Intel server motherboard.
>
> http://www.intel.com/Products/Server/Motherboards/S5520UR/S5520UR-specifications.htm
>
> Intel 5520 chipset with Intel I/O Controller Hub ICH10R
>
> $ numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14
> node 0 size: 24517 MB
> node 0 free: 1523 MB
> node 1 cpus: 1 3 5 7 9 11 13 15
> node 1 size: 24576 MB
> node 1 free: 39 MB
> node distances:
> node 0 1
> 0: 10 21
> 1: 21 10

21 is larger than REMOTE_DISTANCE on x86 and triggers zone_reclaim

19 would keep it off.


> Since I'm not sure what the "distance" values mean, I have no idea if
> those values large or not?

Distance values represent the additional latency necessary to access
remote memory vs local memory (10)

> > 4. Fix the application to be conscious of the effect of memory
> > allocations on a NUMA systems. Use the numa memory allocations API
> > to allocate anonymous memory locally for optimal access and set
> > interleave for the file backed pages.
>
> The problem we saw was purely with file caching. The application wasn't
> actually allocating much memory itself, but it was reading lots of files
> from disk (via mmap'ed memory mostly), and as most people would, we
> expected that data would be cached in memory to reduce future reads from
> disk. That was not happening.

Obviously and you have stated that numerous times. Problem that the use of
a remote memory will reduced performance of reads so the OS (with
zone_reclaim=1) defaults to the use of local memory and favors reclaim of
local memory over the allocation from the remote node. This is fine if
you have multiple applications running on both nodes because then each
application will get memory local to it and therefore run faster. That
does not work with a single app that only allocates from one node.

Control over memory allocations over the various nodes under NUMA
for a process can occur via the numactl ctl or the libnuma C apis.

F.e.e

numactl --interleave ... command

will address that issue for a specific command that needs to go
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/