Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

From: Linus Torvalds
Date: Thu Feb 04 2016 - 17:57:33 EST

Next message: Jason Low: "Re: [PATCH] locking/mutex: Avoid spinner vs waiter starvation"
Previous message: Alexandre Belloni: "Re: [rtc-linux] Re: [PATCH v4 00/10] rtc: max77686: Extend driver and add max77802 support"
In reply to: Tom Herbert: "Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64"
Next in thread: Ingo Molnar: "Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert <tom@xxxxxxxxxxxxxxx> wrote:
>
> The reason I did this in assembly is precisely about the your point of
> having to close the carry chains with adcq $0. I do have a first
> implementation in C which using switch() to handle alignment, excess
> length less than 8 bytes, and the odd number of quads to sum in the
> main loop. gcc turns these switch statements into jump tables (not
> function tables which is what Ingo's example code was using). The
> problem I hit was that for each case I needed to close the carry chain
> in the inline asm so fall through wouldn't have much value and each
> case is expanded. The C version using switch gave a nice performance
> gain, moving to all assembly was somewhat better.

Yeah,. so I _think_ that if my approach works, the code generated for
the 0-8 byte case looks just something like

movq %rsi, %rax
andl $1, %eax
subq %rax, %rdi
movq %rdx, %rax
movq (%rdi), %rcx
andq mask(,%rsi,8), %rcx
addq %rcx,%rax
adcq $0,%rax

which is pretty close to optimal (that's with my hopefully fixed version).

That's not with the final narrowing from 64-bit to 32-bit, but it
looks like it should perform well on most x86 cores, and then the only
remaining branches end up being the actual size checking ones.

And yes, you could avoid a few "adc $0" instructions in asm, but quite
frankly, I think that's a tradeoff that is worth it.

My guess is that you can get to within a percent of optimal with the C
approach, and I really think it makes it easier to try different
things (ie things like the above that avoids the switch table
entirely)

> There is also question of alignment. I f we really don't need to worry
> about alignment at all on x86, then we should be able to eliminate the
> complexity of dealing with it.

So most x86 alignment issues are about crossing the cache bank size,
which is usually 16 or 32 bytes. Unaligned accesses *within* one of
those banks should be basically free (there's a whoppign big byte lane
shifter, so there's lots of hardware support for that).

Also, even when you do cross a cache bank, it's usually pretty cheap.
It extra cycles, but it's generally *less* extra cycles than it would
be to try to align things in software and doing two accesses.

The rule of thumb is that you should never worry about _individual_
unaligned accesses. It's only really worth it aligning things before
biggish loops. So aligning to an 8-byte boundary before you then do a
lot of "adcq" instructions makes sense, but worrying about unaligned
accesses for the beginning/end does generally not.

>>> For example, for the actual "8 bytes or shorter" case, I think
>> something like this might just work fine: [ snip ]
>
> I will look at doing that.

So I already found and fixed one bug in that approach, but I still
think it's a viable model.

But you will almost certainly have to fix a few more of my bugs before
it really works ;)

Linus

Next message: Jason Low: "Re: [PATCH] locking/mutex: Avoid spinner vs waiter starvation"
Previous message: Alexandre Belloni: "Re: [rtc-linux] Re: [PATCH v4 00/10] rtc: max77686: Extend driver and add max77802 support"
In reply to: Tom Herbert: "Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64"
Next in thread: Ingo Molnar: "Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]