Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

From: Joonsoo Kim
Date: Mon Dec 15 2014 - 02:55:29 EST


On Wed, Dec 10, 2014 at 10:30:17AM -0600, Christoph Lameter wrote:
> We had to insert a preempt enable/disable in the fastpath a while ago. This
> was mainly due to a lot of state that is kept to be allocating from the per
> cpu freelist. In particular the page field is not covered by
> this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
> change for fast path allocation and freeing.
>
> This patch removes the need for the page field to describe the state of the
> per cpu list. The freelist pointer can be used to determine the page struct
> address if necessary.
>
> However, currently this does not work for the termination value of a list
> which is NULL and the same for all slab pages. If we use a valid pointer
> into the page as well as set the last bit then all freelist pointers can
> always be used to determine the address of the page struct and we will not
> need the page field anymore in the per cpu are for a slab. Testing for the
> end of the list is a test if the first bit is set.
>
> So the first patch changes the termination pointer for freelists to do just
> that. The second removes the page field and then third can then remove the
> preempt enable/disable.
>
> Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
> allocator effectiveness will increase further. Also RT uses full preemption which means
> that currently pretty expensive code has to be inserted into the fastpath. This approach
> allows the removal of that code and a corresponding performance increase.
>
> For V1 a number of changes were made to avoid the overhead of virt_to_page
> and page_address from the RFC.
>
> Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
> 20%-50% of fastpath latency:
>
> Before:
>
> Single thread testing
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 68 cycles kfree -> 107 cycles
> 10000 times kmalloc(16) -> 69 cycles kfree -> 108 cycles
> 10000 times kmalloc(32) -> 78 cycles kfree -> 112 cycles
> 10000 times kmalloc(64) -> 97 cycles kfree -> 112 cycles
> 10000 times kmalloc(128) -> 111 cycles kfree -> 119 cycles
> 10000 times kmalloc(256) -> 114 cycles kfree -> 139 cycles
> 10000 times kmalloc(512) -> 110 cycles kfree -> 142 cycles
> 10000 times kmalloc(1024) -> 114 cycles kfree -> 156 cycles
> 10000 times kmalloc(2048) -> 155 cycles kfree -> 174 cycles
> 10000 times kmalloc(4096) -> 203 cycles kfree -> 209 cycles
> 10000 times kmalloc(8192) -> 361 cycles kfree -> 265 cycles
> 10000 times kmalloc(16384) -> 597 cycles kfree -> 286 cycles
>
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 114 cycles
> 10000 times kmalloc(16)/kfree -> 115 cycles
> 10000 times kmalloc(32)/kfree -> 117 cycles
> 10000 times kmalloc(64)/kfree -> 115 cycles
> 10000 times kmalloc(128)/kfree -> 111 cycles
> 10000 times kmalloc(256)/kfree -> 116 cycles
> 10000 times kmalloc(512)/kfree -> 110 cycles
> 10000 times kmalloc(1024)/kfree -> 114 cycles
> 10000 times kmalloc(2048)/kfree -> 110 cycles
> 10000 times kmalloc(4096)/kfree -> 107 cycles
> 10000 times kmalloc(8192)/kfree -> 108 cycles
> 10000 times kmalloc(16384)/kfree -> 706 cycles
>
>
> After:
>
>
> Single thread testing
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 41 cycles kfree -> 81 cycles
> 10000 times kmalloc(16) -> 47 cycles kfree -> 88 cycles
> 10000 times kmalloc(32) -> 48 cycles kfree -> 93 cycles
> 10000 times kmalloc(64) -> 58 cycles kfree -> 89 cycles
> 10000 times kmalloc(128) -> 84 cycles kfree -> 104 cycles
> 10000 times kmalloc(256) -> 92 cycles kfree -> 125 cycles
> 10000 times kmalloc(512) -> 86 cycles kfree -> 129 cycles
> 10000 times kmalloc(1024) -> 88 cycles kfree -> 125 cycles
> 10000 times kmalloc(2048) -> 120 cycles kfree -> 159 cycles
> 10000 times kmalloc(4096) -> 176 cycles kfree -> 183 cycles
> 10000 times kmalloc(8192) -> 294 cycles kfree -> 233 cycles
> 10000 times kmalloc(16384) -> 585 cycles kfree -> 291 cycles
>
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 100 cycles
> 10000 times kmalloc(16)/kfree -> 108 cycles
> 10000 times kmalloc(32)/kfree -> 101 cycles
> 10000 times kmalloc(64)/kfree -> 109 cycles
> 10000 times kmalloc(128)/kfree -> 125 cycles
> 10000 times kmalloc(256)/kfree -> 60 cycles
> 10000 times kmalloc(512)/kfree -> 60 cycles
> 10000 times kmalloc(1024)/kfree -> 67 cycles
> 10000 times kmalloc(2048)/kfree -> 60 cycles
> 10000 times kmalloc(4096)/kfree -> 65 cycles
> 10000 times kmalloc(8192)/kfree -> 60 cycles

Hello, Christoph.

I don't review in detail, but, at a glance, overall patchset looks good.
But, above result looks odd. Improvement is beyond what we can expect.
Do you have any idea why allocating object more than 256 bytes is so
fast?

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/