Re: [PATCH 1/2] lib/find: Make functions safe on changing bitmaps

From: Mirsad Todorovac
Date: Fri Oct 27 2023 - 11:51:14 EST


On 10/27/2023 5:51 AM, Yury Norov wrote:
On Wed, Oct 25, 2023 at 10:18:00AM +0200, Rasmus Villemoes wrote:
On 25/10/2023 09.18, kernel test robot wrote:


Hello,

kernel test robot noticed a 3.7% improvement of will-it-scale.per_thread_ops on:

So with that, can we please just finally say "yeah, let's make the
generic bitmap library functions correct

They are all correct already.

and usable in more cases"

See below.

instead of worrying about random micro-benchmarks that just show
you-win-some-you-lose-some.

That's I agree. I don't worry about either +2% or -3% benchmark, and
don't think that they alone can or can't justificate such a radical
change like making all find_bit functions volatile, and shutting down
a newborn KCSAN.

Keeping that in mind, my best guess is that Jan's and Misrad's test
that shows +2% was against stable bitmaps; and what robot measured
is most likely against heavily concurrent access to some bitmap in
the kernel.

I didn't look at both tests sources, but that at least makes some
sense, because if GCC optimizes code against properly described
memory correctly, this is exactly what we can expect.

Yes, users will have to treat results from the find routines carefully
if their bitmap may be concurrently modified. They do. Nobody wins if
those users are forced to implement their own bitmap routines for their
lockless algorithms.

Again, I agree with this point, and I'm trying to address exactly this.

I'm working on a series that introduces lockless find_bit functions
based on existing FIND_BIT() engine. It's not ready yet, but I hope
I'll submit it in the next merge window.

https://github.com/norov/linux/commits/find_and_bit

Now that we've got a test that presumably works faster if find_bit()
functions are all switched to be volatile, it would be great if we get
into details and understand:
- what find_bit function or functions gives that gain in performance;
- on what bitmap(s);

I am positive that your test_and_clear_bit() loop per bit wouldn't improve performance. No insult, but it has to:

- LOCK the bus
- READ the (byte/word/longword/quadword) from memory
- TEST the bit and remember the result in FLAGS
- CLEAR the bit
- WRITE back the entire byte/word/longword/quadword

So, instead of speeding up, you'd end up reading 64 times in the worst case where only the last bit in the quadword is set.

What we need is this ffs() that expands to assembly instruction BSF
(bit scan forward)

[1] https://www.felixcloutier.com/x86/bsf

followed by a BTR (bit test and reset)

[2] https://www.felixcloutier.com/x86/btr

Ideally, we'd have an instruction that does both, and atomic, or a way to LOCK the bus for two instructions. But bad guys could use that to stall all cores indefinitely:

DON'T DO THIS!
-----------------------------
LOCK
loop:
BSF r16, m16/m32/m64
BTR m16/m32/m64, r16
JMP loop
UNLOCK
-----------------------------

This would better work with the hardware-assisted CAM locking device, than stopping all cores from reading and writing on each memory barrier.

But this is a long story.

- is the reason in concurrent memory access (guess yes), and if so,
- can we refactor the code to use lockless find_and_bit() functions
mentioned above;
- if not, how else can we address this.

If you or someone else have an extra time slot to get deeper into
that, I'll be really thankful.

I don't know if the Linux kernel uses any advantage of a trasnactional memory device if it finds one?

The idea of each mutex() or even user-space futex() stalling all cores to "asm LOCK CMPXCHG m8/m16/m32/m64, r8/r16/r32/r64" simply doesn't scale well with i.e. 128 cores that have to wait for one long read-modify-write cycle that bypasses cache and talks to very slow memory.

I see some progress with per-CPU variables.

[3] https://stackoverflow.com/questions/58664496/locking-on-per-cpu-variable/67997961#67997961

For multiprocessor system and to protect percpu variables, we can just disable preemption and do local_irq_save. This way we avoid taking the spinlock. Spinlock requires atomicity across multiple CPU's. With per cpu variables it shall not be required.

local_irq_save(flags);
preempt_disable();
-- Modify the percpu variable
preempt_enable();
local_irq_restore(flags);

That compiles roughly to:

unsigned long flags;

asm volatile("cli": : :"memory");
asm volatile(
"mrs %0, " IRQMASK_REG_NAME_R " @local_save_flags"
: "=r" (flags) : : "memory", "cc");
preempt_count_inc();
__asm__ __volatile__("": : :"memory") // barrier()
--- your stuff here ---
if (!(!(flags & X86_EFLAGS_IF))
asm volatile("sti": : :"memory");
__asm__ __volatile__("": : :"memory") // barrier()
preempt_count_dec();

With this code other CPUs can work in parallel. None of the CPU spins.

If we take spinlock then we modify an atomic variable also other CPU comes then it has to wait/spin if spinlock is acquired.

Best regards,
Mirsad


Thanks,
Yury