Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

From: Linus Torvalds
Date: Fri Nov 04 2005 - 16:23:26 EST




On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> I am not enough of a kernel level person or sysadmin to know for certain,
> but I have still big worries about consecutive jobs that run on the
> same resources, but want extremely different page behavior. If what
> you are suggesting can cause all previous history on those resources
> to be forgotten and then reset to whatever it is that I want when I
> start my run, then yes.

That would largely be the behaviour.

When you use the hugetlb zone for big pages, nothing else would be there.

And when you don't use it, we'd be able to use those zones for at least
page cache and user private pages - both of which are fairly easy to evict
if required.

So the downside is that when the admin requests such a zone at boot-time,
that will mean that the kernel will never be able to use it for its
"normal" allocations. Not for inodes, not for directory name caching, not
for page tables and not for process and file descriptors. Only a very
certain class of allocations that we know how to evict easily could use
them.

Now, for many loads, that's fine. User virtual pages and page cache pages
are often a big part (in fact, often a huge majority) of the memory use.

Not always, though. Some loads really want lots of metadata caching, and
if you make too much of memory be in the largepage zones, performance
would suffer badly on such loads.

But the point is that this is easy(ish) to do, and would likely work
wonderfully well for almost all loads. It does put a small onus on the
maintainer of the machine to give a hint, but it's possible that normal
loads won't mind the limitation and that we could even have a few hugepage
zones by default (limit things to 25% of total memory or something). In
fact, we would almost have to do so initially just to get better test
coverage.

Now, if you want _most_ of memory to be available for hugepages, you
really will always require a special boot option, and a friendly machine
maintainer. Limiting things like inodes, process descriptors etc to a
smallish percentage of memory would not be acceptable in general.

Something like 25% "big page zones" probably is fine even in normal use,
and 50% might be an acceptable compromise even for machines that see a
mixture of pretty regular use and some specialized use. But a machine that
only cares about certain loads might boot up with 75% set aside in the
large-page zones, and that almost certainly would _not_ be a good setup
for random other usage.

IOW, we want a hit up-front about how important huge pages would be.
Because it's practically impossible to free pages later, because they
_will_ become fragmented with stuff that we definitely do not want to
teach the VM how to handle.

But the hint can be pretty friendly. Especially if it's an option to just
load a lot of memory into the boxes, and none of the loads are expected to
want to really be excessively close to memory limits (ie you could just
buy an extra 16GB to allow for "slop").

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/