Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

From: Benjamin Gilbert
Date: Sat Jun 09 2007 - 20:34:21 EST


Jeff Garzik wrote:
Matt Mackall wrote:
Have you benchmarked this against lib/sha1.c? Please post the results.
Until then, I'm frankly skeptical that your unrolled version is faster
because when I introduced lib/sha1.c the rolled version therein won by
a significant margin and had 1/10th the cache footprint.

See the benchmark tables in patch 0 at the head of this thread. Performance improved by at least 25% in every test, and 40-60% was more common for the 32-bit version (on a Pentium IV).

It's not just the loop unrolling; it's the register allocation and spilling. For comparison, I built SHATransform() from the drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty close to what you tested back then. The resulting code is 49% MOV instructions, and 80% of *those* involve memory. gcc4 is somewhat better, but it still spills a whole lot, both for the 2.6.11 unrolled code and for the current lib/sha1.c.

In contrast, the assembly implementation in this patch only has to go to memory for data and workspace (with one small exception in the F3 rounds), and the workspace has a fifth of the cache footprint of the default implementation.

Yes. And it also depends on the CPU as well. Testing on a server-class x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce different result than from popular but less-capable "value" CPUs.

Good point. I benchmarked the 32-bit assembly code on a couple more boxes:

=== AMD Duron, average of 5 trials ===
Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
block update (C) (asm)
0 16 16 104 72 31%
1 64 16 52 36 31%
2 64 64 45 29 36%
3 256 16 33 23 30%
4 256 64 27 17 37%
5 256 256 24 14 42%
6 1024 16 29 20 31%
7 1024 256 20 11 45%
8 1024 1024 19 11 42%
9 2048 16 28 20 29%
10 2048 256 19 11 42%
11 2048 1024 18 10 44%
12 2048 2048 18 10 44%
13 4096 16 28 19 32%
14 4096 256 18 10 44%
15 4096 1024 18 10 44%
16 4096 4096 18 10 44%
17 8192 16 27 19 30%
18 8192 256 18 10 44%
19 8192 1024 18 10 44%
20 8192 4096 17 10 41%
21 8192 8192 17 10 41%

=== Classic Pentium, average of 5 trials ===
Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
block update (C) (asm)
0 16 16 145 144 1%
1 64 16 72 61 15%
2 64 64 65 52 20%
3 256 16 46 39 15%
4 256 64 39 32 18%
5 256 256 36 29 19%
6 1024 16 40 33 18%
7 1024 256 30 23 23%
8 1024 1024 29 23 21%
9 2048 16 39 32 18%
10 2048 256 29 22 24%
11 2048 1024 28 22 21%
12 2048 2048 28 22 21%
13 4096 16 38 32 16%
14 4096 256 28 22 21%
15 4096 1024 28 21 25%
16 4096 4096 27 21 22%
17 8192 16 38 32 16%
18 8192 256 28 22 21%
19 8192 1024 28 21 25%
20 8192 4096 27 21 22%
21 8192 8192 27 21 22%

The improvement isn't as good, but it's still noticeable.

--Benjamin Gilbert

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/