Re: [PATCH 4/3] Replace dynamic percpu implementation

From: Ravikiran G Thirumalai (kiran@in.ibm.com)
Date: Thu May 22 2003 - 03:14:23 EST

Next message: Grover, Andrew: "RE: must-fix list, v5"
Previous message: David S. Miller: "Re: [CHECKER] 12 potential leaks in kernel 2.5.69"
In reply to: Dipankar Sarma: "Re: [PATCH 4/3] Replace dynamic percpu implementation"
Next in thread: Rusty Russell: "Re: [PATCH 4/3] Replace dynamic percpu implementation"
Reply: Rusty Russell: "Re: [PATCH 4/3] Replace dynamic percpu implementation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, May 21, 2003 at 04:01:56PM +0530, Dipankar Sarma wrote:
>...
> We will do some measurements with this but based on a large number
> of measurements that Kiran had done earlier, we can see a couple of things -
>
> 1. Even though a percpu scheme using pointer arithmatic has one less memory
> reference, the globally shared offset table is often in the cache
> and therefore pointer arithmatic offers no added advantage.
>
> 2. Increased sharing of cacheline helps by reducing associativity misses.
> We see this by comparing an interlaced allocator where only same
> sized objects share blocks vs. the current static allocator. Sharing of
> blocks by differently sized objects also allow cache lines to be
> kept warm as more subsystems in the kernel access them.
>

Here is the summary of my experiments with difft per-cpu allocator methods.

The following methods were compared
1. Static per-cpu areas
2. kmalloc_percpu with NR_CPUS pointers and one extra dereference -- the
   current implementation (no interlace) (kmalloc_percpu_current)
3. kmalloc_percpu with pointer arithmetic, but no interlace
   (kmalloc_percpu_new)
4. alloc_percpu using Rusty's block allocator and the shared offset table
   (alloc_percpu_block)

Application used was speeding up vm_enough_memory using per-cpu counters
and reducing atomic_operataions. Benchmark used was kernbench. Profile
ticks on vm_enough_memory was used to compare allocator methods
(vm_acct_memory was made inline). This was on a 4 processor pIII xeon.

To summarise,
1. Static per-cpu areas was 6.5 % better that kmalloc_percpu_current
2. kmalloc_percpu_new and static per-cpu areas had similar results.
3. alloc_percpu results were similar to static per-cpu areas and
   kmalloc_percpu_new
4. Extra dereferences in alloc_percpu were not significant, but alloc_percpu
   was interlaced and kmalloc_percpu_new wasn't. Insn profile seemed to
   indicate extra cost in memory dereferencing of alloc_percpu was
   offset by the interlacing/objects sharing the same cacheline part.
   but then insn profiles are only indicative...not accurate.

todo:
I have to see how a interlaced kmalloc_percpu with pointer arithmetic
fares in these tests (once i have it working) and the performace part
of the percpu allocators will be hopefully clear.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Grover, Andrew: "RE: must-fix list, v5"
Previous message: David S. Miller: "Re: [CHECKER] 12 potential leaks in kernel 2.5.69"
In reply to: Dipankar Sarma: "Re: [PATCH 4/3] Replace dynamic percpu implementation"
Next in thread: Rusty Russell: "Re: [PATCH 4/3] Replace dynamic percpu implementation"
Reply: Rusty Russell: "Re: [PATCH 4/3] Replace dynamic percpu implementation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri May 23 2003 - 22:00:48 EST