Re: [PATCH] Fast csum_partial_copy_generic and more

From: kumon@flab.fujitsu.co.jp
Date: Sun May 21 2000 - 20:50:43 EST

Next message: Tigran Aivazian: "Re: A few questions."
Previous message: Tigran Aivazian: "Re: Kernel BUG in loopback fs in -pre8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

ying@almaden.ibm.com writes:
> In your profiling, do you know if the cache misses are caused by falsing
> sharing among CPUs
> or by small cache size, since both cases would cause cache miss? One of
> your previous emails seems
> to suggest that the SRC misses tend to be caused by the later, since you
> saw a lot of I's, but the
> DST misses could be both, since you saw both I's and S's. But I'm not
> exactly sure. So, I'd like to
> confirm with you.

Our xeon has 2MB-L2, so I hope it is enough to hold the data. But, no
direct measurements on SRC area were achieved, some data are actually
needed to confirm it.

DST misses are rather easy to explain. DST area are not anchored to
specific CPUs nor NICs, it is dynamically allocated by kmalloc and
freed by kfree on demand. These functions don't consider data
locality in a parallel cache.

It is also known that the most of kmalloc/kfree's are called from
alloc_skb and the most used size is 2048B block. Which is consistent
to the fact that csum..() is frequently called with length 1460B.

Combining these circumstantial evidence, I conclude the data is
rotated among CPUs. But, no experiment which directly confirms it has
been done yet.

> It seems that if the reason for misses were the former (wrong CPU sharing),
> some sort of process scheduling change
> may be helpful, but I really don't know how. If it were the later (small
> cache size), it probably means
> that the code path in between send/write and csum_partial_copy_generic is
> relatively long and
> touched more data than what the cache can hold. I guess if you also see a
> lot of cache misses from
> even a Uni processor test (on the same machine that you used for SMP
> tests), that probably means
> that the misses probably are coming from small cache problem.

No PMC measurement on 1CPU is done yet. But data exists which shows
the execution time for a single web-transaction of csum_..() becomes
longer (at most +80%) if num of cpu is increase from 1 to 4.

Currently, I have a NIC trouble, it should be solved first..

> kumon@flab.fujitsu.co.jp@vger.rutgers.edu on 05/19/2000 08:50:38 PM
>
> Please respond to kumon@flab.fujitsu.co.jp
>
> Sent by: owner-linux-kernel@vger.rutgers.edu

???

-- Computer Systems Laboratory, Fujitsu Labs. kumon@flab.fujitsu.co.jp

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/

Next message: Tigran Aivazian: "Re: A few questions."
Previous message: Tigran Aivazian: "Re: Kernel BUG in loopback fs in -pre8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue May 23 2000 - 21:00:20 EST