Re: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers

From: Linus Torvalds
Date: Thu Apr 07 2011 - 13:21:28 EST

Next message: Dave Hansen: "[PATCH 1/2] rename alloc_pages_exact()"
Previous message: Dave Hansen: "[PATCH 2/2] make new alloc_pages_exact()"
In reply to: Andi Kleen: "Re: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers"
Next in thread: Andi Kleen: "Re: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 7, 2011 at 9:42 AM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
>
> I'm sure a single barrier would have fixed the testers, as you point out,
> but the goal wasn't to only fix testers.

You missed two points:

- first off, do we have any reason to believe that the rdtsc would
migrate down _anyway_? As AndyL says, both Intel and AMD seem to
document only the "[lm]fence + rdtsc" thing with a single fence
instruction before.

Instruction scheduling isn't some kind of theoretical game. It's a
very practical issue, and CPU schedulers are constrained to do a good
job quickly and _effectively_. In other words, instructions don't just
schedule randomly. In the presense of the barrier, is there any reason
to believe that the rdtsc would really schedule oddly? There is never
any reason to _delay_ an rdtsc (it can take no cache misses or wait on
any other resources), so when it is not able to move up, where would
it move?

IOW, it's all about "practical vs theoretical". Sure, in theory the
rdtsc could move down arbitrarily. In _practice_, the caller is going
to use the result (and if it doesn't, the value doesn't matter), and
thus the CPU will have data dependencies etc that constrain
scheduling. But more practically, there's no reason to delay
scheduling, because an rdtsc isn't going to be waiting for any
interesting resources, and it's also not going to be holding up any
more important resources (iow, sure, you'd like to schedule subsequent
loads early, but those won't be fighting over the same resource with
the rdtsc anyway, so I don't see any reason that would delay the rdtsc
and move it down).

So I suspect that one lfence (before) is basically the _same_ as the
current two lfences (around). Now, I can't guarantee it, but in the
absense of numbers to the contrary, there really isn't much reason to
believe otherwise. Especially considering the Intel/AMD
_documentation_. So we should at least try it, I think.

- the reason "back-to-back" (with the extreme example being in a
tight loop) matters is that if something isn't in a tight loop, any
jitter we see in the time counting wouldn't be visible anyway. One
random timestamp is meaningless on its own. It's only when you have
multiple ones that you can compare them. No?

So _before_ we try some really clever data dependency trick with new
inline asm and magic "double shifts to create a zero" things, I really
would suggest just trying to remove the second lfence entirely and see
how that works. Maybe it doesn't work, but ...

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dave Hansen: "[PATCH 1/2] rename alloc_pages_exact()"
Previous message: Dave Hansen: "[PATCH 2/2] make new alloc_pages_exact()"
In reply to: Andi Kleen: "Re: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers"
Next in thread: Andi Kleen: "Re: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]