Re: [PATCH tip/core/urgent 3/7] rcu: Streamline code produced by__rcu_read_unlock()

From: Paul E. McKenney
Date: Thu Jul 21 2011 - 01:09:52 EST

On Wed, Jul 20, 2011 at 03:44:55PM -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 11:26 AM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > Given some common flag combinations, particularly -Os, gcc will inline
> > rcu_read_unlock_special() despite its being in an unlikely() clause.
> > Use noinline to prohibit this misoptimization.
> Btw, I suspect that we should at least look at what it would mean if
> we make the rcu_read_lock_nesting and the preempt counters both be
> per-cpu variables instead of making them per-thread/process counters.
> Then, when we switch threads, we'd just save/restore them from the
> process register save area.
> There's a lot of critical code sequences (spin-lock/unlock, rcu
> read-lock/unlock) that currently fetches the thread/process pointer
> only to then offset it and increment the count. I get the strong
> feeling that code generation could be improved and we could avoid one
> level of indirection by just making it a per-thread counter.
> For example, instead of __rcu_read_lock: looking like this (and being
> an external function, partly because of header file dependencies on
> the data structures involved):
> push %rbp
> mov %rsp,%rbp
> mov %gs:0xb580,%rax
> incl 0x100(%rax)
> leaveq
> retq
> it should inline to just something like
> incl %gs:0x100
> instead. Same for the preempt counter.
> Of course, it would need to involve making sure that we pick a good
> cacheline etc that is already always dirty. But other than that, is
> there any real downside?

We would need a form of per-CPU variable access that generated
efficient code, but that didn't complain about being used when
preemption was enabled. __this_cpu_add_4() might do the trick,
but I haven't dug fully through it yet.

Thanx, Paul
