Re: Memory corruption due to word sharing

From: Linus Torvalds
Date: Fri Feb 03 2012 - 15:00:24 EST

Next message: Corey Minyard: "Re: [PATCH 3/6] ipmi: use a tasklet for handling received messages"
Previous message: Stephen Warren: "RE: [PATCH 0/3] arm: tegra: dts: USB Updates for Tegra legacy mode/ OTG"
In reply to: Andrew MacLeod: "Re: Memory corruption due to word sharing"
Next in thread: Paul E. McKenney: "Re: Memory corruption due to word sharing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Feb 3, 2012 at 11:16 AM, Andrew MacLeod <amacleod@xxxxxxxxxx> wrote:
>> The special cases are because older x86 cannot do the generic
>> "add_return" efficiently - it needs xadd - but can do atomic versions
>> that test the end result and give zero or sign information.
>
> Since these are older x86 only, could you use add_return() always and then
> have the compiler use new peephole optimizations to detect those usage
> patterns and change the instruction sequence for x86 when required? would
> that be acceptable? Or maybe you don't trust the compiler :-) Or maybe
> I can innocently ask if the performance impact on older x86 matters enough
> any more? :-)

Since xadd is often slow even when it does exist, the peepholes would
definitely be required. A "dec_and_test" does need to be

decl %m (or "subl $1" for x86-64)
je

rather than

xadd %r,%m
cmp $1,%r
je

simply because the latter is not just larger, but often quite a bit
slower (newer CPU's seem to be better at xadd).

But with the peepholes, by the time we could depend on gcc doing the
intrinsics, we'd almost certainly be ready to let the old pre-xadd
machines just go.

We're already almost there, and by the time we can trust the compiler
to do it for us, we're almost certainly going to be at a point where
we really don't care any more about pre-xadd (but we will still care
about older machines where xadd is slow).

>> - atomic_add_unless() - basically an optimized cmpxchg.
>
> is this the reverse of a compare_exchange and add? Ie, add if the value
> ISN'T expected? or some form of compare_exchange_and_add? This might
> require a new atomic builltin.
>
> What exactly does it do?

It's basically

do {
load-link %r,%m
if (r == value)
return 0;
add
} while (store-conditional %r,%m)
return 1;

and it is used to implement two *very* common (and critical)
reference-counting use cases:

- decrement ref-count locklessly, and if it hits free, take a lock
atomically (ie "it would be a bug for anybody to ever see it decrement
to zero while holding the lock")

- increment ref-count locklessly, but if we race with the final free,
don't touch it, because it's gone (ie "if somebody decremented it to
zero while holding the lock, we absolutely cannot increment it again")

They may sound odd, but those two operations are absolutely critical
for most lock-less refcount management. And taking locks for reference
counting is absolutely death to performance, and is often impossible
anyway (the lock may be a local lock that is *inside* the structure
that is being reference-counted, so if the refcount is zero, then you
cannot even look at the lock!)

In the above example, the load-locked -> store-conditional would
obviously be a cmpxchg loop instead on architectures that do cmpxchg
instead of ll/sc.

Of course, it you expose some intrinsic for the whole "ll/sc" model
(and you then turn it into cmpxchg on demand), we could literally
open-code it.

That is likely the most *flexible* approach for a compiler. I think
pretty much everything the kernel needs (except for cmpxchg_double)
can be very naturally written as a "ll/sc" sequence, and if the
compiler then just does the right thing with peephole optimizations,
that's fine.

IOW, we don't really need "atomic_add()" or anything like that. If we can do

do {
val = __load_linked(mem);
val++;
} while (__store_conditional(val, mem));

and the compiler just automagically turns that into "lock add" on x86,
that's perfectly fine.

It might not be too hard, because you really don't need to recognize
very many patterns, and any pattern you don't recognize could be
turned into a cmpxchg loop.

NOTE NOTE NOTE! The "turned into a cmpxchg loop" is not the true
correct translation of load-linked/store-conditional, since it allows
the memory to be modified as long as it's modified *back* before the
store-conditional, and that actually matters for a few algorithms. But
if you document the fact that it's not a "true" ll/sc (and maybe have
some compile-time way to detect when it isn't), it would be very
flexible.

Of course, the syntax could be something completely different. Maybe
you'd want to do it as

__builtin_ll_sc(&var, update-expression, return-expression,
failure-expression)

rather than an explicit loop.

But it doesn't sound like the internal gcc model is based on some
generic ll/sc model.

>> - atomic bit array operations (bit set, clear, set-and-test,
>> clear-and-test). We do them on "unsigned long" exclusively, and in
>> fact we do them on arrays of unsigned long, ie we have the whole "bts
>> reg,mem" semantics. I'm not sure we really care about the atomic
>> versions for the arrays, so it's possible we only really care about a
>> single long.
>>
>> The only complication with the bit setting is that we have a
>> concept of "set/clear bit with memory barrier before or after the bit"
>> (for locking). We don't do the whole release/acquire thing, though.
>
> Are these functions wrappers to a tight load, mask, cmpxchg loop? or
> something else? These could also require new built-ins if they can't be
> constructed from the existing operations...

Not usually, no. Those things tend to be used for locking or setting
state, so we'd commonly want them to result in

lock orl $128,(%mem)

or for the testing case they would become

lock btsl $0,(%mem)
jne .. failed ..

and a cmpxchg loop would generally be much too expensive.

I realize that people have bad memories of the x86 bit instructions,
but especially in their locked form, the fact that they take a few
extra cycles or decode in only one pipeline etc is *not* relevant.
They are small and "fast", because the true cost tends to be not the
instruction cost, but the locking overhead and the cache effects.

And a "cmpxchg" loop tends to be *very* expensive for two reasons:

- mispredicted backwards jump (most cpu's think it's a loop, so
without prediction information they do the wrong thing)

- load+cmpxchg does the wrong thing for a cache miss (the load will
result in the cacheline being fetched for read)

both of which tend to be totally invisible if you do the obvious
microbenchmark, making people sometimes mistakenly believe that
"cmpxchg" is a good way to implement things.

>> - compare_xchg_double
>>
>> We also do byte/word atomic increments and decrements, but that' sin
>> the x86 spinlock implementation, so it's not a generic need.
>
> The existing __atomic builtins will work on 1,2,4,8 or 16 byte values
> regardless of type, as long as the hardware supports those sizes. so x86-64
> can do a 16 byte cmpxchg.

Ok.

> In theory, the add_fetch and sub_fetch are suppose to use INC/DEC if the
> operand is 1/-1 and the result isn't used. If it isnt doing this right now,
> I will fix it.

Yeah, we definitely want that fixed to inc/dec or regular add/sub.
xadd really isn't good, even outside the "some CPU's don't support
it".

> It may be possible to add modifier extensions to the memory model component
> for such a thing. ie
>
> v = __atomic_add_fetch (&v, __ATOMIC_RELAXED | __ATOMIC_CPU_LOCAL)

Ok, that sounds clean.

> All of the __atomic operations are currently optimization barriers in both
> directions, the optimizers tend to treat them like function calls. I hope
> to enable some sorts of optimizations eventually, especially based on memory
> model... but for now we play it safe.

That's fine. I really don't worry about compiler barriers, and the use
of __ATOMIC_ACQUIRE and __ATOMIC_RELEASE will actually allow us to do
better than we do now, if the compiler just implements them correctly
on all architectures.

Right now we have special (and very ugly) extra conditional
architecture-specific full memory barriers ("smp_mb__after_bitop()"
etc) that are complete hacks, but generate "good enough" code (which
on x86 means "generate nothing", since the atomic bitops are already
barriers on their own).

Having access to __ATOMIC_ACQUIRE would actually be an improvement -
it's just that the architectures that really care about things like
that simply don't matter enough for us to really worry about them
(ia64 being the prime example), and everybody else either doesn't need
extra barriers at all (x86), or might as well use a full barrier (most
everything else).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Corey Minyard: "Re: [PATCH 3/6] ipmi: use a tasklet for handling received messages"
Previous message: Stephen Warren: "RE: [PATCH 0/3] arm: tegra: dts: USB Updates for Tegra legacy mode/ OTG"
In reply to: Andrew MacLeod: "Re: Memory corruption due to word sharing"
Next in thread: Paul E. McKenney: "Re: Memory corruption due to word sharing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]