Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lockout of line

From: Linus Torvalds
Date: Wed Feb 13 2013 - 13:30:27 EST


On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Adding an external function call is *horrible*, and you might almost
> as well just uninline the spinlock entirely if you do this. It means
> that all the small callers now have their registers trashed, whether
> the unlikely function call is taken or not, and now leaf functions
> aren't leaves any more.

Btw, we've had things like this before, and I wonder if we could
perhaps introduce the notion of a "light-weight call" for fastpath
code that calls unlikely slow-path code..

In particular, see the out-of-line code used by the rwlocks etc (see
"arch_read_lock()" for an example in arch/x86/include/asm/spinlock.h
and arch/x86/lib/rwlock.S), where we end up calling things from inline
asm, with one big reason being exactly the fact that a "normal" C call
has such horribly detrimental effects on the caller.

Sadly, gcc doesn't seem to allow specifying which registers are
clobbered any easy way, which means that both the caller and the
callee *both* tend to need to have some asm interface. So we bothered
to do this for __read_lock_failed, but we have *not* bothered to do
the same for the otherwise very similar __mutex_fastpath_lock() case,
for example.

So for rwlocks, we actually get very nice code generation with small
leaf functions not necessarily needing stack frames, but for mutexes
we mark a lot of registers "unnecessarily" clobbered in the caller,
exactly because we do *not* do that asm interface for the callee. So
we have to clobber all the standard callee-clobbered registers, which
is really sad, and callers almost always need a stack frame, because
if they have any data live at all across the mutex, they have to save
it in some register that is callee-saved - which basically means that
the function has to have that stack frame in order to save its *own*
callee-saved registers.

So it means that we penalize the fastpath because the slow-path can't
be bothered to do the extra register saving, unless we go to the
lengths we went to for the rwlocks, and build a wrapper in asm to save
the extra registers in the cold path.

Maybe we could introduce some helpers to create these kinds of asm
wrappers to do this? Something that would allow us to say: "this
function only clobbers a minimal set of registers and you can call it
from asm and only mark %rax/rcx/rdx clobbered" and that allows leaf
functions to look like leaf functions for the fastpath?

Hmm? That would make my dislike of uninlining the slow case largely go
away. I still think that back-off tends to be a mistake (and is often
horrible for virtualization etc), but as long as the fastpath stays
close to optimal, I don't care *too* much.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/