Re: [PATCH v2 1/3] lib/find_bit: introduce FIND_FIRST_BIT() macro

From: Yury Norov
Date: Wed Aug 24 2022 - 09:19:29 EST


On Wed, Aug 24, 2022 at 12:10:02PM +0300, Andy Shevchenko wrote:
> On Wed, Aug 24, 2022 at 4:51 AM Yury Norov <yury.norov@xxxxxxxxx> wrote:
> >
> > Now that we have many flavors of find_first_bit(), and expect even more,
> > it's better to have one macro that generates optimal code for all and makes
> > maintaining of slightly different functions simpler.
> >
> > The logic common to all versions is moved to the new macro, and all the
> > flavors are generated by providing an EXPRESSION macro-parameter, like
> > in this example:
> >
> > #define FIND_FIRST_BIT(EXPRESSION, size) ...
> >
> > find_first_ornot_and_bit(addr1, addr2, addr3, size)
> > {
> > return FIND_NEXT_BIT(addr1[idx] | ~addr2[idx] & addr3[idx], size);
> > }
> >
> > The EXPRESSION may be of any complexity, as soon as it only refers
> > the bitmap(s) and an iterator idx.
> >
> > The 'word_op' is here to allow the macro to generate code for _le
> > versions on big-endian machines; used in the following patches.
>
> ...
>
> > +#ifndef word_op
> > +#define word_op
> > +#endif
>
> Not sure about the naming without namespace. Perhaps __ffs_word_op?
>
> > +#define FIND_FIRST_BIT(EXPRESSION, size) \
> > +({ \
> > + unsigned long idx, val, sz = (size); \
> > + \
> > + for (idx = 0; idx * BITS_PER_LONG < sz; idx++) { \
>
> I think we can do slightly better:
>
> for (unsigned long idx = 0; idx < sz; idx += BITS_PER_LONG) {
> unsigned long val;

This will blow up the EXPRESSION. We can mitigate it on user side:
find_first_bit(addr, size)
{
return FIND_FIRST_BIT(addr[idx/BITS_PER_LONG], size);
}

But to me it's a wtf++.

And generated code looks almost the same, except that
on x86_64 your version is bigger. Compare before:
0000000000000000 <_find_first_bit>:
0: mov %rsi,%rax
3: test %rsi,%rsi
6: je 35 <_find_first_bit+0x35>
8: xor %edx,%edx
a: jmp 19 <_find_first_bit+0x19>
c: add $0x40,%rdx // Track bits and
10: add $0x8,%rdi // index separately
14: cmp %rax,%rdx
17: jae 35 <_find_first_bit+0x35>
19: mov (%rdi),%rcx
1c: test %rcx,%rcx
1f: je c <_find_first_bit+0xc>
21: tzcnt %rcx,%rcx
26: add %rdx,%rcx
29: cmp %rcx,%rax
2c: cmova %rcx,%rax
30: jmp 35 <_find_first_bit+0x35>
35: jmp 3a <_find_first_bit+0x3a>
3a: nopw 0x0(%rax,%rax,1)

And after:
0000000000000000 <_find_first_bit>:
0: mov %rsi,%rax
3: test %rsi,%rsi
6: je 39 <_find_first_bit+0x39>
8: xor %edx,%edx
a: jmp 15 <_find_first_bit+0x15>
c: add $0x40,%rdx // Track bits only
10: cmp %rdx,%rax
13: jbe 39 <_find_first_bit+0x39>
15: mov %rdx,%rcx
18: shr $0x6,%rcx // But divide here
1c: mov (%rdi,%rcx,8),%rcx
20: test %rcx,%rcx
23: je c <_find_first_bit+0xc>
25: tzcnt %rcx,%rcx
2a: add %rcx,%rdx
2d: cmp %rdx,%rax
30: cmova %rdx,%rax
34: jmp 39 <_find_first_bit+0x39>
39: jmp 3e <_find_first_bit+0x3e>
3e: xchg %ax,%ax // Which adds 4 bytes to .text

Thanks,
Yury

> > + val = (EXPRESSION); \
> > + if (val) { \
> > + sz = min(idx * BITS_PER_LONG + __ffs(word_op(val)), sz);\
>
> sz = min(idx + __ffs(...));
>
> > + break; \
> > + } \
> > + } \
> > + \
> > + sz; \
> > +})
>
>
> --
> With Best Regards,
> Andy Shevchenko