Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Ingo Molnar
Date: Wed Oct 16 2013 - 02:26:11 EST



* Joe Perches <joe@xxxxxxxxxxx> wrote:

> On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> > * Joe Perches <joe@xxxxxxxxxxx> wrote:
> >
> > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > > attached patch brings much better results
> > > > > >
> > > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > > Recv Send Send Utilization Service Demand
> > > > > > Socket Socket Message Elapsed Send Recv Send Recv
> > > > > > Size Size Size Time Throughput local remote local remote
> > > > > > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
> > > > > >
> > > > > > 87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
> > > > > >
> > > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > > []
> > > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > > > zero = 0;
> > > > > > count64 = count >> 3;
> > > > > > while (count64) {
> > > > > > - asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > > + asm("prefetch 5*64(%[src])\n\t"
> > > > >
> > > > > Might the prefetch size be too big here?
> > > >
> > > > To be effective, you need to prefetch well ahead of time.
> > >
> > > No doubt.
> >
> > So why did you ask then?
> >
> > > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> > >
> > > 5 cachelines for some processors seems like a lot.
> >
> > What processors would that be?
>
> The ones where conservatism in L1 cache use is good because there are
> multiple threads running concurrently.

What specific processor models would that be?

> > Most processors have hundreds of cachelines even in their L1 cache.
>
> And sometimes that many executable processes too.

Nonsense, this is an unrolled loop running in softirq context most of the
time that does not get preempted.

> > Thousands in the L2 cache, up to hundreds of thousands.
>
> Irrelevant because prefetch doesn't apply there.

What planet are you living on? Prefetch takes memory from L2->L1 memory
just as much as it moves it cachelines from memory to the L2 cache.

Especially in the usecase cited here there will be a second use of the
data (when it's finally copied over into user-space), so the L2 cache size
very much matters.

The prefetches here matter mostly to the packet being processed: the ideal
size of the look-ahead window in csum_partial() is dictated by typical
memory latencies and bandwidth. The amount of parallelism is limited by
the number of carry bits we can maintain independently.

> Ingo, Eric _showed_ that the prefetch is good here. How about looking at
> a little optimization to the minimal prefetch that gives that level of
> performance.

Joe, instead of using a condescending tone in matters you clearly have
very little clue about you might want to start doing some real kernel
hacking in more serious kernel areas, beyond trivial matters such as
printk strings, to gain a bit of experience and respect ...

Every word you uttered in this thread made it more likely for me to
redirect you to /dev/null, permanently.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/