Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount

From: Linus Torvalds
Date: Sun Sep 01 2013 - 20:50:24 EST


On Sun, Sep 1, 2013 at 5:12 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> It *is* one of the few locked accesses remaining, and it's clearly
> getting called a lot (three calls per system call: two mntput's - one
> for the root path, one for the result path, and one from path_init ->
> rcu_walk_init), but with up to 8% CPU time for basically that one
> "lock xadd" instruction is damn odd. I can't see how that could happen
> without seriously nasty cacheline bouncing, but I can't see how *that*
> can happen when all the accesses seem to be from the current CPU.

So, I wanted to double-check that "it can only be that expensive if
there's cacheline bouncing" statement. Thinking "maybe it's just
really expensive. Even when running just a single thread".

So I set MAX_THREADS to 1 in my stupid benchmark, just to see what happens..

And almost everything changes as expected: now we don't have any
cacheline bouncing any more, so lockref_put_or_lock() and
lockref_get_or_lock() no longer dominate - instead of being 20%+ each,
they are now just 3%.

What _didn't_ change? Right. lg_local_lock() is still 6.40%. Even when
single-threaded. It's now the #1 function in my profile:

6.40% lg_local_lock
5.42% copy_user_enhanced_fast_string
5.14% sysret_check
4.79% link_path_walk
4.41% 0x00007ff861834ee3
4.33% avc_has_perm_flags
4.19% __lookup_mnt
3.83% lookup_fast

(that "copy_user_enhanced_fast_string" is when we copy the "struct
stat" from kernel space to user space)

The instruction-level profile just looking like

â ffffffff81078e70 <lg_local_lock>:
2.06 â push %rbp
1.06 â mov %rsp,%rbp
0.11 â mov (%rdi),%rdx
2.13 â add %gs:0xcd48,%rdx
0.92 â mov $0x100,%eax
85.87 â lock xadd %ax,(%rdx)
0.04 â movzbl %ah,%ecx
â cmp %al,%cl
3.60 â â je 31
â nop
â28: pause
â movzbl (%rdx),%eax
â cmp %cl,%al
â â jne 28
â31: pop %rbp
4.22 â â retq

so that instruction sequence is just expensive, and it is expensive
without any cacheline bouncing. The expense seems to be 100% simply
due to the fact that it's an atomic serializing instruction, and it
just gets called way too much.

So lockref_[get|put]_or_lock() are each called once per pathname
lookup (because the RCU accesses to the dentries get turned into a
refcount, and then that refcount gets dropped). But lg_local_lock()
gets called twice: once for path_init(), and once for mntput() - I
think I was wrong about mntput getting called twice.

So it doesn't seem to be cacheline bouncing at all. It's just
"serializing instructions are really expensive" together with calling
that function too much. And we've optimized pathname lookup so much
that even a single locked instruction shows up like a sort thumb.

I guess we should be proud.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/