Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

From: William Lee Irwin III
Date: Mon Sep 13 2004 - 21:38:06 EST


William Lee Irwin III <wli@xxxxxxxxxxxxxx> wrote:
>> A large stream of faults to map in a file will blow L1 caches of the
>> sizes you've mentioned at every kernel/user context switch. 256 distinct
>> cachelines will very easily be referenced between faults. MAP_POPULATE
>> and mlock() don't implement batching for either ->page_table_lock or
>> ->tree_lock, so the pagevec point is moot in pagetable instantiation
>> codepaths (though it probably shouldn't be).

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> Instantiation via normal fault-in becomes lock-intensive once you have
> enough CPUs. At low CPU count the page zeroing probably preponderates.

But that's mm->page_table_lock, for which pagevecs aren't used, and for
faulting there isn't a batch of work to do anyway, so other methods are
needed, e.g. more finegrained pte locking or lockless pagetable radix
trees. I'd favor going fully lockless, but haven't put any code down
for it myself.


William Lee Irwin III <wli@xxxxxxxxxxxxxx> wrote:
>> O_DIRECT writes and msync(..., ..., MS_SYNC) will use pagevecs on
>> ->tree_lock in a rapid-fire process-triggerable manner. Almost all
>> uses of pagevecs for ->lru_lock outside the scanner that I'm aware
>> of are not rapid-fire in nature (though there probably should be some).

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> pagetable teardown (munmap, mremap, exit) is the place where pagevecs help
> ->lru_lock. And truncate.

truncate is rare enough I didn't bother but that probably hurts it the
worst. I'd expect pte teardown to affect mostly anonymous pages, as
file-backed pages will have a reference from the mapping holding them
until either that's shot down or they're evicted via page replacement.


William Lee Irwin III <wli@xxxxxxxxxxxxxx> wrote:
>> IMHO pagevecs are somewhat underutilized.

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> Possibly. I wouldn't bother converting anything unless a profiler tells
> you to though.

mlock() is the case I have in hand, though I've only heard of it being
problematic on vendor kernels. MAP_POPULATE is underutilized in
userspace thus far, so I've not heard anything about it good or bad.


William Lee Irwin III <wli@xxxxxxxxxxxxxx> wrote:
>> Sorry, 4*lg(NR_CPUS) is 64 when lg(NR_CPUS) = 16, or 65536 cpus. 512x
>> Altixen would have 4*lg(512) = 4*9 = 36. The 4*lg(NR_CPUS) sizing was
>> rather conservative on behalf of users of stack-allocated pagevecs.

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> It's pretty simple to diddle PAGEVEC_SIZE, run a few benchmarks. If that
> makes no difference then the discussion is moot. If it makes a significant
> difference then more investigation is warranted.

It only really applies when either the lock transfer time is high, such
as it is in NUMA architectures with high remote access latencies to
sufficiently distant nodes, or the arrival rates are high, such as they
are in unusual workloads, as the common cases (and even some of the
unusual ones) have already been addressed. Marcelo's original bits
about cache alignment are likely more pressing; my real hope was to get
some guess at the pagevec size based on NR_CPUS, align it upward to a
cacheline boundary, and subtract the size of the header information,
but as I found there are diminishing returns, and there are cache size
boundaries to be concerned about, I'd favor advancing marcelo's cache
alignment work prior to exploring larger batches' benefits.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/