Re: [PATCH v3 1/3] lib: find_*_bit reimplementation

From: Rasmus Villemoes
Date: Thu Feb 12 2015 - 04:59:01 EST


On Thu, Feb 12 2015, "George Spelvin" <linux@xxxxxxxxxxx> wrote:

>> Rasmus, your version has ANDing by mask, and resetting the mask at each iteration
>> of main loop. I think we can avoid it. What do you think on next?
>
> Yes, that's basically what I proposed (modulo checking for zero size and
> my buggy LAST_WORD_MASK).
>
> But two unconditional instructions in the loop are awfully minor; it's
> loads and conditional branches that cost.
>
> The reset of the mask can be done in parallel with other operations; it's
> only the AND that actually takes a cycle.
>
> I can definitely see the argument that, for code that's not used often
> enough to stay resident in the L1 cache, any speedup has to win by at
> least one L2 cache access to be worth taking another cache line.

That, and if the compiler was smart enough, the AND should actually be
free (at least on x86), since it can replace a test instruction. [1]

In any case, my code compiles to fewer bytes (though not an entire cache
line), and I think it is somewhat clearer - I'm obviously biased on the
latter point.

Rasmus


[1] Unfortunately, the compiler doesn't seem to be smart enough :-( When I
compile my code with gcc 5.0, I get

0000000000000000 <find_last_bit_rv>:
0: 53 push %rbx
1: 89 f1 mov %esi,%ecx
3: 48 8d 5e 3f lea 0x3f(%rsi),%rbx
7: f7 d9 neg %ecx
9: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
10: 48 83 e3 c0 and $0xffffffffffffffc0,%rbx
14: 48 d3 ea shr %cl,%rdx
17: eb 1a jmp 33 <find_last_bit_rv+0x33>
19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
20: 48 89 d1 mov %rdx,%rcx
23: 48 23 0c df and (%rdi,%rbx,8),%rcx
27: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
2e: 48 85 c9 test %rcx,%rcx
31: 75 15 jne 48 <find_last_bit_rv+0x48>
33: 48 83 eb 01 sub $0x1,%rbx
37: 48 83 fb ff cmp $0xffffffffffffffff,%rbx
3b: 75 e3 jne 20 <find_last_bit_rv+0x20>
3d: 48 89 f0 mov %rsi,%rax
40: 5b pop %rbx
41: c3 retq
42: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
48: 48 89 cf mov %rcx,%rdi
4b: 48 c1 e3 06 shl $0x6,%rbx
4f: e8 00 00 00 00 callq 54 <find_last_bit_rv+0x54>
54: 48 98 cltq
56: 48 01 d8 add %rbx,%rax
59: 5b pop %rbx
5a: c3 retq
5b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

the main loop is 20--3b. The test instruction at 2e seems to be
redundant. The same at 37: the sub instruction already sets plenty of
flags that could be used, so explicitly comparing %rbx to -1 seems
redundant.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/