Re: [PATCH 2/2] bitops: rotate: Add riscv implementation using Zbb extension

From: Yury Norov
Date: Thu Jul 03 2025 - 12:59:20 EST


On Wed, Jul 02, 2025 at 11:11:35AM +0100, David Laight wrote:
> On Tue, 1 Jul 2025 14:32:14 -0400
> Yury Norov <yury.norov@xxxxxxxxx> wrote:
>
> I'd not worry about rotates of 8 bits or more (for ror8).
> They can be treated as 'undefined behaviour' under the assumption they don't happen.

Good for you. But generic implementation is safe against overflowing
the shift, so the arch must be safe as well.

> The 'generic' version needs them to get gcc to generate a 'rorb' on x86.
> The negated shift needs masking so that clang doesn't throw the code away when
> the value is constant.

...

> > > I compared the performance of ror8 (zbb optimized) and generic_ror8 on the XUANTIE C908
> > > by looping them. ror8 is better, and the advantage of ror8 becomes more obvious as the
> > > number of iterations increases. The test code is as follows:
> > > ```
> > > u8 word = 0x5a;
> > > u32 shift = 9;
> > > u32 i, loop = 100;
> > > u8 ret1, ret2;
> > >
> > > u64 t1 = ktime_get_ns();
> > > for (i = 0; i < loop; i++) {
> > > ret2 = generic_ror8(word, shift);
> > > }
> > > u64 t2 = ktime_get_ns();
> > > for (i = 0; i < loop; i++) {
> > > ret1 = ror8(word, shift);
> > > }
> > > u64 t3 = ktime_get_ns();
> > >
> > > pr_info("t2-t1=%lld t3-t2=%lld\n", t2 - t1, t3 - t2);
> > > ```
> >
> > Please do the following:
> >
> > 1. Drop the generic_ror8() and keep only ror/l8()
> > 2. Add ror/l16, 34 and 64 tests.
> > 3. Adjust the 'loop' so that each subtest will take 1-10 ms on your hw.
>
> That is far too many iterations.
> You'll get interrupts dominating the tests.

That's interesting observation. Can you show numbers for your
hardware?

> The best thing is to do 'just enough' iterations to get a meaningful result,
> and then repeat a few times and report the fastest (or average excluding
> any large outliers).
>
> You also need to ensure the compiler doesn't (or isn't allowed to) pull
> the contents of the inlined function outside the loop - and then throw
> the loop away,

Not me - Chen Pei needs. I wrote __always_used for it. It should
help.

> The other question is whether any of it is worth the effort.
> How many ror8() and ror16() calls are there?
> I suspect not many.

I'm not a RISC-V engineer, and I can't judge how they want to use the
functions. This doesn't bring significant extra burden on generic
side, so I don't object against arch ror8() on RISCs.

> Improving the generic ones might be worth while.
> Perhaps moving the current versions to x86 only.
> (I suspect the only other cpu with byte/short rotates is m68k)
>
> David