Re: ZONE_PADDING wastes 4 bytes of the new cacheline

From: Nick Piggin
Date: Thu Oct 21 2004 - 22:23:02 EST


Andrea Arcangeli wrote:
On Fri, Oct 22, 2004 at 10:34:13AM +1000, Nick Piggin wrote:

Andrea Arcangeli wrote:


looks reasonable. only cons is that this rejects on my tree ;), pages_*
and protection is gone in my tree, replaced by watermarks[] using the
very same optimal and proven algo of 2.4 (enabled by default of course).
I'll reevaluate the false sharing later on.


May I again ask what you think is wrong with ->protection[] apart from
it being turned off by default? (I don't think our previous conversation
ever reached a conclusion...)


the API is flawed, how can set it up by default? if somebody tweaks
pages_min it get screwed.


The setup code leaves a bit to be desired, but for the purpose of
kswapd and the page allocator they are fine.

plus it's worthless to have pages_min/low/high and protection[] when you
can combine the two things together. those pages_min/low/high and
protection combined when protection itself is calculated in function of
pages_min/low/high just creates confusion. I believe this comments
explains it well enough:


I don't agree, there are times when you need to know the bare pages_xxx
watermark, and times when you need to know the whole ->protection thing.

/*
* setup_per_zone_protection - called whenver min_free_kbytes or
* sysctl_lower_zone_protection changes. Ensures that each zone
* has a correct pages_protected value, so an adequate number of
* pages are left in the zone after a successful __alloc_pages().
*
* This algorithm is way confusing. I tries to keep the same
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* behavior
* as we had with the incremental min iterative algorithm.
*/

I've to agree with the comment, if I've to start doing math on top of
this algorithm, it's almost faster to replace it with the 2.4 one that
has clear semantics.

Example of the runtime behavaiour of the very well understood
lowmem_reserve from 2.4, it's easier to make an example than to explain
it with words:

1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
1G machine -> (16M dma, 784M normal, 224M high)

sysctl defaults are:

int sysctl_lower_zone_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 };

the results will be:

1) NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA

2) HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL

3) HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA

I invented this algorithm to scale in all machines and to make the
memory reservation not noticeable and only beneficial. With this API is
trivial to tune and to understand what will happen, the default value
looks good and they're proven by years of production in 2.4. Most
important: it's only in function of the sizes of the zones, the
pages_min levels have nothing to do with it.


OK I dont disagree that your setup calculations are much nicer, and
the current ones are pretty broken...

I'm still unsure if the 2.6 lower_zone_protection completely mimics the
2.4 lowmem_zone_reserve algorithm if tuned by reversing the pages_min
settings accordingly, but I believe it's easier to drop it and replace
with a clear understandable API that as well drops the pages_min levels
that have no reason to exists anymore, than to leave it in its current
state and to start reversing pages_min algorithm to tune it from
userspace (in the hope nobody could ever tweak pages_min calculation in
the kernel, to avoid breaking the userspace that would require
kernel-internal knowledge to have a chance to tune lowmem_protection
from a rc.d script).


But please no wholesale replacement of the ->pages_xxx / ->protection
thing unless you really show it is needed (which I'm pretty sure it
isn't). alloc_pages is very nice right now ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/