> 99.99% of the time we want a single page.
Yep, but...there are those occassions when more than one page is
needed. These are the hardest requests to handle (or prepare for).
I was suggesting 'priming' the free memory pool for these cases (without
introducing a large overhead - less restrictions on selecting a page to
reap == more chance of getting the page 'type' you need).
> > But they are not freed immediately. Instead, update the necessary pte's
> > (and anon structs), place the pages on a reclaim list, and tell the
> > paging out daemon to start.
> That is more or less what we have. Pages are kept in the page cache
> or buffer pool as long as possible even when unused. Kswapd keeps
> the actual free lists topped up to minimise overhead on allocation
> requests plus requests fall back to calling try_to_free_page
> themselves if necessary.
Think you might have missed my point (or I didn't make it v. well), or
I'm mis-understanding your reply. What you say above is correct. My point
was extending the 'swap-cache', to make it more general (into a
"relcaim-list").
The current swap-cache is used to prevent a thrash case on the swap
device after an anonymous page has been unmapped from an address space and
the process references it v. soon after. (Remember, Linux has asynchronous
writes to swap).
> > OK, I've simplified the above quite a bit. But do you get the idea?
> > If you think you're seen the above before, you're correct. It's what
> > some other UNIX OSes do, and for good reasons.
>
> Linux is not entirely dissimilar but it is optimised more for the
> "sufficient memory" case - there is no way to go from a page to
> the page tables it is used in.
Where the lack of physical-page to pte(s) mapping causes the most
problems (for me anyway) is in a "make -j zImage" on the kernel
source.
When Linux identifies a named page to be reaped, it _only_ removes it
from the address space under the process that it found the page. There
maybe other processes that still have a ref to the page. Hence, all the
searching work has been done for zero gain - OK, not totally zero. At
least when it stumbles over the other mappings for the page it will be
released.
I'm being a bit harsh here. It's not very often that a single text
page has many mappings, and needs to be unmapped. Plus the text page
usage patterns in my example probably means releasing such a page is not
good - it would soon be faulted in again.
Also note the line "if (page_map->count != 1) return 0;" in
try_to_swap_out(). This prevents the paging out of shared anonymous
pages. Only effects some pages, but I'd like to get rid of it (hence
the need for anon structs).
> > and improves the performance of swap (when reading
> > back, we read several pages back - locality of
> > reference on a none page granularity).
> Linux also does page ahead.
Unless I've missed something, there is no read-ahead on a swap device!
(No point as pages are not guaranteed any order on swap, as I was
proposing).
Clustering of swap pages, yes, but that's something different. Have a look
at scan_swap_map() for the clustering (been sometime since I looked at
this func, but I seem to remember the idea is to prevent a disk head from
moving all over a swap device writing single pages in the holes left by
fault-ins from the device).
> > iii) Identify pages that allow coalescing in the free
> > memory pool.
>
> I reclaim unused pages from the page cache when coalescing and it
> seems worthwhile. It needs work to mutex things so we can do it
> at interrupt time as well though.
Good. I though it would be messey (is it?). I was thinking about using
the age of pages in the cache when deciding what would coalesce with
pages in the free memory pool. Perhaps using a random selection policy
for a page is good enough (is this what you're doing?). By random I
mean pseudo-random, ie. the first page we find, from the previous
starting point, which allows a coalesce.
> > See my slab allocator on http://www.nextd.demon.co.uk/
> Yes, I've been looking at it. There are some interesting ideas. But it
> is relatively slow by my current standards :-). PCs have all too
> little L1 cache, all but the oldest processors have enough pipelining
> to be able to init a structure pretty quickly and conditional branches
> are a killer - even with the Cyrix branch prediction. It may be
> possible to optimise slab though...
Slow? Slow! Maybe in the v. _worst_ case. The common, critical, path is
short, with few branches (haven't checked the assembly - I should do! -
just relied on how I believe gcc produces code. eg. "if() goto ???" when
the 'if' is rarely taken, and "if () {common-case}" when the 'if' will
normally be taken. But I should check what it produces...).
If some of the handlers to allow different types of packing are removed,
then the paths can be made even short. A compromise is always needed;
using memory efficiently, against the instruction count. Yes, it needs a
bit of tuning (already being done), but not much.
As for the L1 cache? A structure is usually referenced many times before
it is freed, so helping whatever L1 cache is available is well worthwhile.
(Note, my critical paths have no extra instructions to allow for cache
alignment). Init-ing a struct inline may be quick, but it will
still place instructions into both types of caches (instruction and
data). If a constructed state can be maintained for a structure (and
hence, some code out of the 'hot' path), then it should be.
> (It still does the silly power-of-two sizing though, doesn't it?)
Most definitely not, walk the code (or ask me questions)!
In my patch the 'general purpose caches' are 2^n because I was rushed in
getting it out and couldn't be bothered to tune the sizes (not a good
recommendation on my part...). The only power of twos are; the num of
pages per slab (due to get_free_page()), bytes per page (h/w constraint),
and the size of cache lines (because that's the way natural intended them).
BTW, lets not knock power-of-two. They are nice numbers (often your're
best friend when efficiency is concerned). Be carefully of using
Fibonacci sequences, or some other sets, for allocator sizes. Their
internal fragmentation tends to become high over time (not good for an
OS). None-coalescing, splitting allocators tend to show the same
behaviour. If you know the usage patterns, it is possible to design
efficient allocators for specific applications. Difficult to do in a
kernel.
Regards,
markhe