Re: [PATCH] 2.6.0-test4 -- add context switch counters

From: William Lee Irwin III
Date: Wed Aug 27 2003 - 12:57:56 EST


On Wed, Aug 27, 2003 at 09:09:39AM -0700, Larry McVoy wrote:
> This is the classic response that I get whenever I raise this sort of
> concern. I got it at Sun, I got it at SGI. Everyone says "my change
> made no difference". And they are right from one point of view: you
> run some micro benchmark and you can't see any difference.
> Of course you can't see any difference, in the microbenchmark everything
> is in the cache. But you did increase the amount of cache usage.
> Consider a real world case where the application and the kernel now
> just exactly fit in the caches for the critical loop. Adding one
> extra cache line will hurt that application but would never be seen in
> a microbenchmark.

I used a macrobenchmark for this measurement with instruction-level
profiling on cache misses, TLB misses, and cpu cycles.

An unusual result of this was that with respect to cpu cycles, the
most costly operation in the entire kernel after bitblitting userspace
memory was rounding the stack pointer to find current_thread_info();
that is, it was #3, behind only copy_to_user_ll()/copy_from_user_ll().


On Wed, Aug 27, 2003 at 09:09:39AM -0700, Larry McVoy wrote:
> The only way to really measure this is with real work loads and a cache
> miss counter. And even that won't always show up because if the work load
> you choose happened to only use 1/2 of the data cache (for instance) you
> need to add enough more than 1/2 of the cache lines to show up in the
> results.

I already used the cache miss counter. Seeing mm->rss take numerous
cache misses in the loop of copy_page_range() (in mainline!) seemed
unusual. A vaguely plausible explanation (guesswork is required without
an ITP/ICE or a sufficiently useful simulator) is that the pagetable
bitblitting evicted it from the cache, despite my _very_ intense
efforts to reduce the amount of pagetable bitblitting via cacheing.
An alternative explanation is that the off-node access to slab memory
took such large remote access penalties when it did have cache misses
that even a low miss rate elevated it to the the top of the profile.


On Wed, Aug 27, 2003 at 09:09:39AM -0700, Larry McVoy wrote:
> Think of it this way: we can add N extra cache lines and see no
> difference. Then we add the Nth+1 and all of a sudden things get slow.
> Is that the fault of the Nth+1 guy? Nope. It's the fault of all N,
> the Nth+1 guy just had bad timing, he should have gotten his change
> in earlier.
> I realize that I'm being extreme here but if I can get this point across
> that's a good thing. I'm convinced that it was a lack of understanding
> of this point that lead to the bloated commercial operating systems.
> Linux needs to stay fast. Processors have cycle times of a third of a
> nanosecond yet memory is still ~130ns away.

This is not lost on me (and I'm in fact pushing other cache preservation
code very hard; c.f. pagetable cacheing discussions and the soon to be
sent bottom-level pagetable cacheing code in -wli). The fact of the
matter is that we lose a cacheline at a time, and if we've already lost
one to mm->rss, we should utilize the rest of it for whatever other
counters are prudent instead of wasting the rest of it.

A number of the rest of these counters are very infrequently updated;
IMHO such things as nswaps (whenever we get load control, which we seem
to be getting various complaints about lacking) and signal counts are
updated rarely enough to ignore the effects of.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/