Hi,
Here is a patch to speedup csum_partial_copy_generic() on i686 SMP
upto 50%, and also I added some analysis of further optimization from
the view point of SMP cache behavior.
[Patch Summary]
Attatched patch optimizes csum_partial_copy_generic().
Measurement using the WEB-BENCH, the consumption time at the function
is reduced by 33%, so 1.5 times faster than the original.
[Background of csum_partial_copy_generic]
In the funcion of i686 code, long-word transfer are unfolded 16
times, one loop copies exactly 64 bytes, except the beginning and the
final fraction processing.
From the observation of the csum_partial_copy_generic behavior, it
produces lots of cache misses, and this is the main reason of
slowness.
[How]
To accelerate the function, use dummy read as pre-fetch. Of course
pre-fetching must be done only within the accessible area. And the
top word of a cache block should be prefetched, because the CPU gets
the requested word first when the cache is miss-hit, the first word is
always needed earlier than the other word in the block.
To keep track to the block top word, a new pointer is added, which
always points the first word of the cache block which contains the
63th byte of a source block. This pointer never points outside of the
source region, so it is safe to read. This patch doesn't make sense
if the CPU is not a super-scalar type execution.
Strictly speaking, this prefetch may read just after source regionn at
most 3 byte. But it never causes trouble, because this excessive area
and the last transfered byte reside in a same cache block.
[Benchmarking Result]
We used Web-bench as a workload, and measure three versions of
csum_partial_copy_generic.
Version synopsis,
2.3.99-pre8-base: the original kernel. but SLAB_POISONING is disabled
to compare performance to older kernels. but the older data
is not attached.
2.3.99-pre8-AS: Add "Artur Skawina <skawina@geocities.com>" patch.
This pache is also attached at the end of mail.
2.3.99-pre8-Pf: Add my pre-fetch patch.
This patch also included in this mail.
We obtained the following profile.
The number is average consumption time (unit is us) for one
web-transaction processing.
Machine is 4 SMP Xeon 450MHz 2MB with 2GB, w/o HIMEM,
so actually only1 GB is recognized.
2.3.99-pre8-Pf
2.3.99-pre8-AS
2.3.99-pre8-base
990.3 1023.8 1019.3 TOTAL(OS)
64.9 58.2 54.4 default_idle
24.8 21.8 20.8 cpu_idle
63.4 94.8 98.3 csum_partial_copy_generic
84.2 82.8 82.2 stext_lock
76.0 76.8 77.7 boomerang_interrupt
36.6 36.5 36.9 boomerang_rx
33.4 33.6 34.3 boomerang_start_xmit
28.6 29.0 29.4 schedule
22.4 24.2 24.1 kmalloc
21.5 23.5 22.0 kfree
19.5 20.2 19.9 tcp_v4_rcv
18.0 18.4 18.3 wait_for_completion
17.9 18.3 17.9 __kfree_skb
16.7 16.5 15.7 __wake_up
11.0 11.3 10.8 do_IRQ
rest dropped
Csum_partial_copy_generic becomes 98.3us->63.4us by using Pf patch, so
the patch gained 1.5 times speedup.
Unfortunately, AS version does not show a significant gain. If the
cache is hit,it may show some advantage. But unfortunately, in the
current execution environment, the patch is difficult to hide
cache-miss latency.
By using the user-land benchmark, the new patch also reduce time even
when the source operand is aligned at the cache boundary.
By applying the patch, stext_lock re-appear to the top of time
consumption race. Last time stext_lock lost its position by the
poll() kernel-lock avoidance.
**
The following shows the break down of stext_lock at tahe prefetched
version. Do_close() is the current worst spin-lock waiter. Do_close()
locks the lock relatively longer time. It tighten the bottleneck by
itself.
us from where lockvar
63.4 TOTAL
--------------------------------
12.1 do_close+144 0xc025b080<-kernel_lock
8.71 sys_fcntl+142 0xc025b080
7.65 schedule+1684 0xc025b080
6.27 sys_open+72 0xc025b080
6.07 _fput+27 0xc025b080
5.97 old_mmap+301 0xc025b080
5.53 sys_newstat+30 0xc025b080
3.42 tcp_v4_rcv+610 0x2c(%ebx)
2.95 boomerang_start_xmit+239 0x188(%ebx)
1.09 tcp_accept+38 0x2c(%esi)
0.65 wait_for_connect+621 0x2c(%ebp)
At this point, there are some different approaches to reduce the
kernel overhead.
1. Shorten the kernel-lock region in do_close() or others.
2. Reduce cache misses in csum_partial_copy_generic().
3. Use other kind of NIC, because boomerang_interrupt() do too many
in/out's those add heavy overhead.
As we've measured eepro100, speedo_interrupt and related functions
need only a half execution time of boomerang_interrupt and others.
I think, the second point, miss-reduction, is very important.
At first, a user program (apache) provides the network data, then if
csum_partial_copy_generic runs on the same CPU, cache shoud be hit.
Another measurement shows massive 2nd-cache miss-hit occurs at
csum_partial_copy_generic for data load, but why?
The time reduction by the pre-fetching also supports the occurence of
miss-hit.
I assume, the data-provider: caller of write() or send(), and the
data-consumer: csum_partial_copy_generic() are not running on the same
CPUs. If it is true, we can reduce overhead by forcing it on the same
CPU, but I have no idea, how??
The rest is PATCH.
------------------
Prefetch version. (Pf)
------------------
diff -rc linux-2.3.99-pre8/arch/i386/lib/checksum.S linux-2.3.99-pre8-Pf/arch/i386/lib/checksum.S
*** linux-2.3.99-pre8/arch/i386/lib/checksum.S Thu Mar 23 07:23:54 2000
--- linux-2.3.99-pre8-Pf/arch/i386/lib/checksum.S Mon May 15 22:45:41 2000
***************
*** 394,419 ****
movl ARGBASE+8(%esp),%edi #dst
movl ARGBASE+12(%esp),%ecx #len
movl ARGBASE+16(%esp),%eax #sum
! movl %ecx, %edx
movl %ecx, %ebx
shrl $6, %ecx
andl $0x3c, %ebx
negl %ebx
subl %ebx, %esi
subl %ebx, %edi
lea 3f(%ebx,%ebx), %ebx
testl %esi, %esi
jmp *%ebx
1: addl $64,%esi
addl $64,%edi
ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
3: adcl $0,%eax
dec %ecx
jge 1b
! 4: andl $3, %edx
jz 7f
cmpl $2, %edx
jb 5f
--- 394,425 ----
movl ARGBASE+8(%esp),%edi #dst
movl ARGBASE+12(%esp),%ecx #len
movl ARGBASE+16(%esp),%eax #sum
! # movl %ecx, %edx
movl %ecx, %ebx
+ movl %esi, %edx
shrl $6, %ecx
andl $0x3c, %ebx
negl %ebx
subl %ebx, %esi
subl %ebx, %edi
+ lea -1(%esi),%edx
+ andl $-32,%edx
lea 3f(%ebx,%ebx), %ebx
testl %esi, %esi
jmp *%ebx
1: addl $64,%esi
addl $64,%edi
+ SRC(movb -32(%edx),%bl) ; SRC(movb (%edx),%bl)
ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
3: adcl $0,%eax
+ addl $64, %edx
dec %ecx
jge 1b
! 4: movl ARGBASE+12(%esp),%edx #len
! andl $3, %edx
jz 7f
cmpl $2, %edx
jb 5f
------------------
Artur Skawina patch. (AS)
------------------
diff -urNp /img/linux-2.3.99pre6pre5/arch/i386/lib/checksum.S linux-2.3.99pre6pre5as/arch/i386/lib/checksum.S
--- /img/linux-2.3.99pre6pre5/arch/i386/lib/checksum.S Wed Mar 29 20:53:25 2000
+++ linux-2.3.99pre6pre5as/arch/i386/lib/checksum.S Sat Apr 22 10:43:28 2000
@@ -374,81 +373,119 @@ DST( movb %cl, (%edi) )
/* Version for PentiumII/PPro */
+/*
+ This is
+ o 70% slower when the source is not 32 bit aligned [ie (long)src&3]
+ o 190% slower when the destination is not 32 bit aligned
+ o 260% slower when both source and destination are not 32 bit aligned
+ o 175% slower when destination is not 64 bit aligned and source _is_ [ie (long)dst&4]
+ o whether source is 64 bit aligned or not does not seem to make much difference
+ */
+
#define ROUND1(x) \
- SRC(movl x(%esi), %ebx ) ; \
- addl %ebx, %eax ; \
- DST(movl %ebx, x(%edi) ) ;
+ SRC(movl x(%esi), %edx ) ;\
+ addl %edx, %eax ;\
+ SRC(movl x+4(%esi), %ebx ) ;\
+ DST(movl %edx, x(%edi) ) ;\
+ adcl %ebx, %eax ;\
+ DST(movl %ebx, x+4(%edi) ) ;\
#define ROUND(x) \
- SRC(movl x(%esi), %ebx ) ; \
- adcl %ebx, %eax ; \
- DST(movl %ebx, x(%edi) ) ;
+ SRC(movl x(%esi), %edx ) ;\
+ adcl %edx, %eax ;\
+ SRC(movl x+4(%esi), %ebx ) ;\
+ DST(movl %edx, x(%edi) ) ;\
+ adcl %ebx, %eax ;\
+ DST(movl %ebx, x+4(%edi) ) ;\
+
+#define ROUNDL(x) \
+ SRC(movl x(%esi), %edx ) ;\
+ adcl %edx, %eax ;\
+ SRC(movl x+4(%esi), %ebx ) ;\
+ adcl %ebx, %eax ;\
+ DST(movl %edx, x(%edi) ) ;\
+ DST(movl %ebx, x+4(%edi) ) ;\
#define ARGBASE 12
csum_partial_copy_generic:
pushl %ebx
- pushl %edi
+ movl ARGBASE+12-4*2(%esp),%ebx #len
pushl %esi
- movl ARGBASE+4(%esp),%esi #src
- movl ARGBASE+8(%esp),%edi #dst
- movl ARGBASE+12(%esp),%ecx #len
- movl ARGBASE+16(%esp),%eax #sum
- movl %ecx, %edx
- movl %ecx, %ebx
- shrl $6, %ecx
- andl $0x3c, %ebx
+ movl ARGBASE+4-4*1(%esp),%esi #src
+ movl %ebx, %ecx
+ pushl %edi
+ movl ARGBASE+8-4*0(%esp),%edi #dst
+ andl $0x38, %ebx
+ addl %ebx, %esi
+ shrl $6, %ecx # len /= 64 (number of longwords per iteration)
+ addl %ebx, %edi
negl %ebx
- subl %ebx, %esi
- subl %ebx, %edi
+ movl ARGBASE+16-4*0(%esp),%eax #sum
lea 3f(%ebx,%ebx), %ebx
- testl %esi, %esi
+ testl %eax,%eax # CF=0
jmp *%ebx
-1: addl $64,%esi
+1:
+ addl $64,%esi
addl $64,%edi
- ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
- ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
- ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
- ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
-3: adcl $0,%eax
+ ROUND1(-64) ROUND (-56)
+ ROUND (-48) ROUND (-40)
+ ROUND (-32) ROUND (-24)
+ ROUND (-16) ROUNDL(-8)
+3:
+ adcl $0,%eax
dec %ecx
jge 1b
-4: andl $3, %edx
+
+ movl ARGBASE+12(%esp),%edx #len
+
+ testl $4,%edx
+ jz 4f
+ SRC(movl (%esi), %ebx )
+ addl %ebx, %eax
+ DST(movl %ebx, (%edi) )
+ leal 4(%esi), %esi
+ leal 4(%edi), %edi
+ adcl $0, %eax
+4:
+ andl $3, %edx
jz 7f
cmpl $2, %edx
jb 5f
SRC( movw (%esi), %dx )
- leal 2(%esi), %esi
DST( movw %dx, (%edi) )
- leal 2(%edi), %edi
je 6f
+ leal 2(%esi), %esi
shll $16,%edx
+ leal 2(%edi), %edi
5:
SRC( movb (%esi), %dl )
DST( movb %dl, (%edi) )
-6: addl %edx, %eax
+6:
+ addl %edx, %eax
adcl $0, %eax
7:
.section .fixup, "ax"
-6001: movl ARGBASE+20(%esp), %ebx # src_err_ptr
- movl $-EFAULT, (%ebx)
+6001:
# zero the complete destination (computing the rest is too much work)
movl ARGBASE+8(%esp),%edi # dst
movl ARGBASE+12(%esp),%ecx # len
+ movl ARGBASE+20(%esp), %ebx # src_err_ptr
xorl %eax,%eax
+ movl $-EFAULT, (%ebx)
rep; stosb
- jmp 7b
+ jmp 7b
6002: movl ARGBASE+24(%esp), %ebx # dst_err_ptr
movl $-EFAULT, (%ebx)
jmp 7b
.previous
- popl %esi
popl %edi
+ popl %esi
popl %ebx
ret
#undef ROUND
#undef ROUND1
-
+
#endif
-- Computer Systems Laboratory, Fujitsu Labs. kumon@flab.fujitsu.co.jp- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Tue May 23 2000 - 21:00:17 EST