Re: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers

From: Raghavendra D Prabhu
Date: Thu Apr 07 2011 - 17:43:19 EST


* On Thu, Apr 07, 2011 at 10:20:31AM -0700, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
On Thu, Apr 7, 2011 at 9:42 AM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:

I'm sure a single barrier would have fixed the testers, as you point out,
but the goal wasn't to only fix testers.

You missed two points:

- first off, do we have any reason to believe that the rdtsc would
migrate down _anyway_? As AndyL says, both Intel and AMD seem to
document only the "[lm]fence + rdtsc" thing with a single fence
instruction before.

Instruction scheduling isn't some kind of theoretical game. It's a
very practical issue, and CPU schedulers are constrained to do a good
job quickly and _effectively_. In other words, instructions don't just
schedule randomly. In the presense of the barrier, is there any reason
to believe that the rdtsc would really schedule oddly? There is never
any reason to _delay_ an rdtsc (it can take no cache misses or wait on
any other resources), so when it is not able to move up, where would
it move?

IOW, it's all about "practical vs theoretical". Sure, in theory the
rdtsc could move down arbitrarily. In _practice_, the caller is going
to use the result (and if it doesn't, the value doesn't matter), and
thus the CPU will have data dependencies etc that constrain
scheduling. But more practically, there's no reason to delay
scheduling, because an rdtsc isn't going to be waiting for any
interesting resources, and it's also not going to be holding up any
more important resources (iow, sure, you'd like to schedule subsequent
loads early, but those won't be fighting over the same resource with
the rdtsc anyway, so I don't see any reason that would delay the rdtsc
and move it down).

So I suspect that one lfence (before) is basically the _same_ as the
current two lfences (around). Now, I can't guarantee it, but in the
absense of numbers to the contrary, there really isn't much reason to
believe otherwise. Especially considering the Intel/AMD
_documentation_. So we should at least try it, I think.

If only one lfence or serializing instruction is to be used, can't we
just use RDTSCP instruction (X86_FEATURE_RDTSCP,available only in >=
i7's and AMD) which both provides TSC as well as an upward serializing
guarantee. I see that instruction being used elsewhere in the kernel to
obtain the current cpu/node (vgetcpu), is it not possible to use it in
this case as well ?

- the reason "back-to-back" (with the extreme example being in a
tight loop) matters is that if something isn't in a tight loop, any
jitter we see in the time counting wouldn't be visible anyway. One
random timestamp is meaningless on its own. It's only when you have
multiple ones that you can compare them. No?

I was looking at this documentation -
http://download.intel.com/embedded/software/IA/324264.pdf (How to
Benchmark Code Execution Times on Intel IA-32 and IA-64) where they try
to precisely benchmark code execution times, and later switch to using
RDTSCP twice to obtain both upward as well as downward guarantees of the
barrier. Now, based on context (loop or not), will a second serializing
instruction be needed or can that too be avoided ?


So _before_ we try some really clever data dependency trick with new
inline asm and magic "double shifts to create a zero" things, I really
would suggest just trying to remove the second lfence entirely and see
how that works. Maybe it doesn't work, but ...

Linus

Attachment: pgp00000.pgp
Description: PGP signature