Re: [PATCH 3/4] bitops: squeeze even more out of fns()

From: Yury Norov
Date: Fri May 03 2024 - 12:15:04 EST


On Fri, May 03, 2024 at 10:19:10AM +0800, Kuan-Wei Chiu wrote:
> +Cc Chin-Chun Chen & Ching-Chun (Jim) Huang
>
> On Thu, May 02, 2024 at 04:32:03PM -0700, Yury Norov wrote:
> > The function clears N-1 first set bits to find the N'th one with:
> >
> > while (word && n--)
> > word &= word - 1;
> >
> > In the worst case, it would take 63 iterations.
> >
> > Instead of linear walk through the set bits, we can do a binary search
> > by using hweight(). This would work even better on platforms supporting
> > hardware-assisted hweight() - pretty much every modern arch.
> >
> Chin-Chun once proposed a method similar to binary search combined with
> hamming weight and discussed it privately with me and Jim. However,
> Chin-Chun found that binary search would actually impair performance
> when n is small. Since we are unsure about the typical range of n in
> our actual workload, we have not yet proposed any relevant patches. If
> considering only the overall benchmark results, this patch looks good
> to me.

fns() is used only as a helper to find_nth_bit().

In the kernel the find_nth_bit() is used in
- bitmap_bitremap((),
- bitmap_remap(), and
- cpumask_local_spread() via sched_numa_find_nth_cpu()

with the bit to search calculated as n = n % cpumask_weigth(). This
virtually implies random uniformly distributed n and word, just like
in the test_fns().

In rebalance_wq_table() in drivers/crypto/intel/iaa/iaa_crypto_main.c
it's used like:

for (cpu = 0; cpu < nr_cpus_per_node; cpu++) {
int node_cpu = cpumask_nth(cpu, node_cpus);
...
}

This is an API abuse, and should be rewritten with for_each_cpu()

In cpumask_any_housekeeping() at arch/x86/kernel/cpu/resctrl/internal.h
it's used like:

90 hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
91 if (hk_cpu == exclude_cpu)
92 hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask);
93
94 if (hk_cpu < nr_cpu_ids)
95 cpu = hk_cpu;

And this is another example of the API abuse. We need to introduce a new
helper cpumask_andnot_any_but() and use it like:

hk_cpu = cpumask_andnot_any_but(exclude_cpu, mask, tick_nohz_full_mask).
if (hk_cpu < nr_cpu_ids)
cpu = hk_cpu;

So, where the use of find_nth_bit() is legitimate, the parameters are
distributed like in the test, and I would expect the real-life
performance impact to be similar to the test.

Optimizing the helper for non-legitimate cases doesn't worth the
effort.

Thanks,
Yury