Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Eric Dumazet
Date: Mon Oct 14 2013 - 18:18:57 EST

Next message: Miller, Mike (OS Dev): "RE: [PATCH] cpqarray: remove deprecated IRQF_DISABLED"
Previous message: Christoph Lameter: "Re: [PATCH 34/34] mm: dynamically allocate page->ptl if it cannotbe embedded to struct page"
In reply to: Eric Dumazet: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Next in thread: Joe Perches: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>
> > So, early testing results today. I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 100000
> > times, recording the time at the start and end of that loop. Results on a 2.4
> > GHz Intel Xeon processor:
> >
> > Without patch: Average execute time for csum_partial was 808 ns
> > With patch: Average execute time for csum_partial was 438 ns
>
> Impressive, but could you try again with data out of cache ?

So I tried your patch on a GRE tunnel and got following results on a
single TCP flow. (short result : no visible difference)

Using a prefetch 5*64([%src]) helps more (see at the end)

cpus : model name : Intel Xeon(R) CPU X5660 @ 2.80GHz

Before patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 20.00 7651.61 2.51 5.45 0.645 1.399

After patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 20.00 7239.78 2.09 5.19 0.569 1.408

Profile on receiver

PerfTop: 1358 irqs/sec kernel:98.5% exact: 0.0% [1000Hz cycles], (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------

19.99% [kernel] [k] csum_partial
7.04% [kernel] [k] copy_user_generic_string
4.92% [bnx2x] [k] bnx2x_rx_int
3.50% [kernel] [k] ipt_do_table
2.86% [kernel] [k] __netif_receive_skb_core
2.35% [kernel] [k] fib_table_lookup
2.19% [kernel] [k] netif_receive_skb
1.87% [kernel] [k] intel_idle
1.65% [kernel] [k] kmem_cache_alloc
1.64% [kernel] [k] ip_rcv
1.51% [kernel] [k] kmem_cache_free

And attached patch brings much better results

lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..f0e10fc 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
zero = 0;
count64 = count >> 3;
while (count64) {
- asm("addq 0*8(%[src]),%[res]\n\t"
+ asm("prefetch 5*64(%[src])\n\t"
+ "addq 0*8(%[src]),%[res]\n\t"
"adcq 1*8(%[src]),%[res]\n\t"
"adcq 2*8(%[src]),%[res]\n\t"
"adcq 3*8(%[src]),%[res]\n\t"

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Miller, Mike (OS Dev): "RE: [PATCH] cpqarray: remove deprecated IRQF_DISABLED"
Previous message: Christoph Lameter: "Re: [PATCH 34/34] mm: dynamically allocate page->ptl if it cannotbe embedded to struct page"
In reply to: Eric Dumazet: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Next in thread: Joe Perches: "Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]