Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

From: Linus Torvalds
Date: Sat Aug 10 2013 - 14:51:46 EST


On Sat, Aug 10, 2013 at 10:18 AM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
>
> We could then play a really ugly stunt by marking NEED_RESCHED by adding
> 0x7fffffff to the counter. Then the whole sequence becomes something like:
>
> subl $1,%fs:preempt_count
> jno 1f
> call __naked_preempt_schedule /* Or a trap */

This is indeed one of the few cases where we probably *could* use
trapv or something like that in theory, but those instructions tend to
be slow enough that even if you don't take the trap, you'd be better
off just testing by hand.

However, it's worse than you think. Preempt count is per-thread, not
per-cpu. So to access preempt-count, we currently have to look up
thread_info (which is per-cpu or stack-based).

I'd *like* to make preempt-count be per-cpu, and then copy it at
thread switch time, and it's been discussed. But as things are now,
preemption enable is quite expensive, and looks something like

movq %gs:kernel_stack,%rdx #, pfo_ret__
subl $1, -8124(%rdx) #, ti_22->preempt_count
movq %gs:kernel_stack,%rdx #, pfo_ret__
movq -8136(%rdx), %rdx # MEM[(const long unsigned int
*)ti_27 + 16B], D.
andl $8, %edx #, D.34545
jne .L139 #,

and that's actually the *good* case (ie not counting any extra costs
of turning leaf functions into non-leaf ones).

That "kernel_stack" thing is actually getting the thread_info pointer,
and it doesn't get cached because gcc thinks the preempt_count value
might alias. Sad, sad, sad. We actually used to do better back when we
used actual tricks with the stack registers and used a const inline
asm to let gcc know it could re-use the value etc.

It would be *lovely* if we
(a) made preempt-count per-cpu and just copied it at thread-switch
(b) made the NEED_RESCHED bit be part of preempt-count (rather than
thread flags) and just made it the high bit

adn then maybe we could just do

subl $1, %fs:preempt_count
js .L139

with the actual schedule call being done as an

asm volatile("call user_schedule": : :"memory");

that Andi introduced that doesn't pollute the register space. Note
that you still want the *test* to be done in C code, because together
with "unlikely()" you'd likely do pretty close to optimal code
generation, and hiding the decrement and test and conditional jump in
asm you wouldn't get the proper instruction scheduling and branch
following that gcc does.

I dunno. It looks like a fair amount of effort. But as things are now,
the code generation difference between PREEMPT_NONE and PREEMPT is
actually fairly noticeable. And PREEMPT_VOLUNTARY - which is supposed
to be almost as cheap as PREEMPT_NONE - has lots of bad cases too, as
Andi noticed.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/