Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

From: Linus Torvalds (torvalds@transmeta.com)
Date: Fri Mar 08 2002 - 21:12:14 EST


On Fri, 8 Mar 2002, Hubertus Franke wrote:
> >
> > The point being that the difference between a "decl" and a "lock ; decl"
> > is about 1:12 or so in performance.
>
> I am no expert in architecture, but if its done through the cache coherency
> mechanism, the overhead shouldn't be 12:1. You simply mark the cache line as
> part of you instruction to avoid a cache line transfer. How can that be 12
> times slower. .. Ready to be educated....

A lock in a SMP system also needs to synchronize the instruction stream,
and not let stores move "out" from the locked region.

On a UP system, this all happens automatically (well, getting it to happen
right is obviously one of the big issues in an out-of-order CPU core, but
it's a very fundamental part of the core, so it's "free" in the sense that
if it isn't done, the CPU simply doesn't work).

On SMP, it's a memory barrier. This is why a "lock ; decl" is more
expensive than a "decl" - it's the implied memory ordering constraints (on
other architectures they are explicit). On an intel CPU, this basically
means that the pipeline is drained, so a locked instruction takes roughly
12 cycles on a PPro core (AMD's K7 core seems to be rather more graceful
about this one). I haven't timed a P4 lately, I think it's worse.

Other architectures do the memory ordering explicitly, and some are
better, some are worse. But it always costs you _something_.

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 15 2002 - 22:00:11 EST