Re: [PATCH] Fast csum_partial_copy_generic and more

From: kumon@flab.fujitsu.co.jp
Date: Fri May 19 2000 - 19:31:02 EST


Artur Skawina writes:
> kumon@flab.fujitsu.co.jp wrote:
> quite possible. it seems, assuming your numbers are accurate, i gave
> up investigating the prefetching too early. it was pretty obvious
> that on a p3 the prefetch instructions would give a speedup, but
> i wasn't sure the dummy read overhead would be worth it on p2.

Before I put the prefetch into kernel, I had a quite confidence
that the slowness was come from massive cache miss-hit. So, I make a
testprogram to produce the situation. During the test, I made the
source area have the folllowing cache state combinations.

 CPU-A is the test CPU to run the csum..() function.

        CPU-A CPU-B
case0 I I
case1 I M
case2 I E
case3 E I
M: modified, E: exclusive, S: shared, I: invalid

In the csum..() case, the source area is not always cache aligned as
already known, I made it miss-aligned by x (0 <= x <= 31) to simulate.
Also, kernel activities shows that the destination is always cache
aligned when data are moved from user to kernel in the web transfer.

So, my benchmark made above situatino and I'd mesured.
And I noticed, prefetch can gain very big advantage.

> > Strictly speaking, this prefetch may read just after source regionn at
> > most 3 byte. But it never causes trouble, because this excessive area
>
> what you could do is to not use SRC(), but have a dummy exception
> handler. (yeah, this would solve Andrea's "buffer overflow" too ;)

This excessive read never causes any miss notification of illegal
region because if the prefetching causes exception, the real transfer
must causes exception. But your suggestion is true because in anyway
exception detecting in prefetching is excessive. But Andrea's checking
has no extra cost..

> I'll play with the patch, try to reproduce your numbers, and see
> if merging both patches would be a win.
> It won't likely happen until after the weekend however.

I've done several experiments, those include merging your patch or DST
area prefetching. DST prefetching showed bad performance. Gain of your
patch is difficult to judge, because the difference was so small. At
the experiment prefetching was very effective in case 1, it gains 35%
execution time reduction, the value is almost same to the real data.

So, I wonder the producer and the consumer run on diferent CPUs. Other
measurmeent using PMC(performance counter), also shows SRC area hits
lot of I state cache, DST area hits both I state and S state. These
are very bad symptoms in SMP environment. i.e. buffer is shared
between wrong CPUS.

DST area intrinsically has data sharing nature. Based on my
understanding, DST area hold a transmit data passed to NICs. This
means data sharing between CPU and NICs is inevitable.
P6 CPUs has a write-buffer which can absorb some of the overhead.

But, SRC area should not.

I hope you can add more performance more from my explanation.

--
Computer Systems Laboratory, Fujitsu Labs.
kumon@flab.fujitsu.co.jp

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue May 23 2000 - 21:00:18 EST