Re: [ltt-dev] cli/sti vs local_cmpxchg and local_add_return

From: Mathieu Desnoyers
Date: Wed Mar 18 2009 - 11:10:45 EST


* Nick Piggin (nickpiggin@xxxxxxxxxxxx) wrote:
> On Wednesday 18 March 2009 02:14:37 Mathieu Desnoyers wrote:
> > * Nick Piggin (nickpiggin@xxxxxxxxxxxx) wrote:
> > > On Tuesday 17 March 2009 12:32:20 Mathieu Desnoyers wrote:
> > > > Hi,
> > > >
> > > > I am trying to get access to some non-x86 hardware to run some atomic
> > > > primitive benchmarks for a paper on LTTng I am preparing. That should
> > > > be useful to argue about performance benefit of per-cpu atomic
> > > > operations vs interrupt disabling. I would like to run the following
> > > > benchmark module on CONFIG_SMP :
> > > >
> > > > - PowerPC
> > > > - MIPS
> > > > - ia64
> > > > - alpha
> > > >
> > > > usage :
> > > > make
> > > > insmod test-cmpxchg-nolock.ko
> > > > insmod: error inserting 'test-cmpxchg-nolock.ko': -1 Resource
> > > > temporarily unavailable dmesg (see dmesg output)
> > > >
> > > > If some of you would be kind enough to run my test module provided
> > > > below and provide the results of these tests on a recent kernel
> > > > (2.6.26~2.6.29 should be good) along with their cpuinfo, I would
> > > > greatly appreciate.
> > > >
> > > > Here are the CAS results for various Intel-based architectures :
> > > >
> > > > Architecture | Speedup | CAS |
> > > > Interrupts |
> > > >
> > > > | (cli + sti) / local cmpxchg | local | sync |
> > > > | Enable (sti) | Disable (cli)
> > > >
> > > > -----------------------------------------------------------------------
> > > >---- ---------------------- Intel Pentium 4 | 5.24
> > > > | 25 | 81 | 70 | 61 | AMD Athlon(tm)64 X2
> > > > | 4.57
> > > >
> > > > | 7 | 17 | 17 | 15 | Intel
> > > >
> > > > Core2 | 6.33 | 6 | 30 | 20
> > > >
> > > > | 18 | Intel Xeon E5405 | 5.25 |
> > > > | 8 24 | 20 | 22 |
> > > >
> > > > The benefit expected on PowerPC, ia64 and alpha should principally come
> > > > from removed memory barriers in the local primitives.
> > >
> > > Benefit versus what? I think all of those architectures can do SMP
> > > atomic compare exchange sequences without barriers, can't they?
> >
> > Hi Nick,
> >
> > I want to compare if it is faster to use SMP cas without barriers to
> > perform synchronization of the tracing hot path wrt interrupts or if it
> > is faster to disable interrupts. These decisions will depend on the
> > benchmark I propose, because it is comparing the time it takes to
> > perform both.
> >
> > Overall, the benchmarks will allow to choose between those two
> > simplified hotpath pseudo-codes (offset is global to the buffer,
> > commit_count is per-subbuffer).
> >
> >
> > * lockless :
> >
> > do {
> > old_offset = local_read(&offset);
> > get_cycles();
> > compute needed size.
> > new_offset = old_offset + size;
> > } while (local_cmpxchg(&offset, old_offset, new_offset) != old_offset);
> >
> > /*
> > * note : writing to buffer is done out-of-order wrt buffer slot
> > * physical order.
> > */
> > write_to_buffer(offset);
> >
> > /*
> > * Make sure the data is written in the buffer before commit count is
> > * incremented.
> > */
> > smp_wmb();
> >
> > /* note : incrementing the commit count is also done out-of-order */
> > count = local_add_return(size, &commit_count[subbuf_index]);
> > if (count is filling a subbuffer)
> > allow to wake up readers
>
> Ah OK, so you just mean the benefit of using local atomics is avoiding
> the barriers that you get with atomic_t.
>
> I'd thought you were referring to some benefit over irq disable pattern.
>

On powerpc and mips, for instance, yes the gain is just the disabled
barriers. On x86 it becomes more interesting because we can remove the
lock; prefix, which gives a good speedup. All I want to do here is to
figure out which of barrier-less local_t ops vs disabling interrupts is
faster (and how much faster/slower) on various architectures.

For instance, on architecture like the powerpc64 (tests provided by Paul
McKenney), it's only a difference of less than 4 cycles between irq
off/irq (14-16 cycles, and this is without doing the data access) and
doing both local_cmpxchg and local_add_return (18 cycles). So given we
might have tracepoints called from NMI context, the tiny performance
impact we have with local_t ops does not counter balance the benefit of
having a lockless NMI-safe trace buffer management algorithm.

Thanks,

Mathieu

>
> > * irq off :
> >
> > (note : offset and commit count would each be written to atomically
> > (type unsigned long))
> >
> > local_irq_save(flags);
> >
> > get_cycles();
> > compute needed size;
> > offset += size;
> >
> > write_to_buffer(offset);
> >
> > /*
> > * Make sure the data is written in the buffer before commit count is
> > * incremented.
> > */
> > smp_wmb();
> >
> > commit_count[subbuf_index] += size;
> > if (count is filling a subbuffer)
> > allow to wake up readers
> >
> > local_irq_restore(flags);
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/