Re: [RFC patch 15/15] LTTng timestamp x86

From: Mathieu Desnoyers
Date: Sat Oct 18 2008 - 13:01:50 EST


* Luck, Tony (tony.luck@xxxxxxxxx) wrote:
> Complexity of dealing with all the random issues that have
> plagued TSC in different cpus over the years definitely
> seems to be a problem.
>

Yes :(

> I have one more idea on how we might be able to use
> TSC locally and still have confidence that we can
> merge local cpu buffers into a consistent stream.
>
> What if we read the HPET occasionally (once per
> second?) and add a record to our per-cpu buffer
> with the value of the HPET. That would give us
> a periodic cross-check of each cpus TSC against
> real time so that a "smart" post-processor can
> sanity check the log entries at regular intervals.
>

Hrm, that would make the timestamps much more sensitive to tracing
hiccups :

- if interrupts are disabled for a long time on the system (kernel bug
or at early boot), we cannot assume those HPET events will be logged
at the expected interval.
- if we are in buffer full condition (buffers are too small to handle
the load and we drop events on buffer full condition), we will not
only have missing events : given we depend on those HPET events to
have a consistent time-base, all the trace time-base must be
considered untrustable.
- we would also have to get this HPET timer value at each subbuffer
boundary (at each page in Steven's implementation). This is required
so we can make sense of the time-base of buffers when we only gather
the last subbuffers written, given the previous ones have been
overwritten in flight-recorder mode. However, with a relatively large
load and small subbuffers (e.g. 4kB), we would have to get this HPET
value 2048 times/second/cpu. On a 512 nodes machine, it may become a
problem. See my analysis of poor HPET scalability below.

> It doesn't deal with the truly insane TSC behaivours
> (like stopping completely in certain C states, or
> varying frequency) ... but it would at least be able
> to reliably detect these forms of insanity.
>

I also like the one done by AMD when the cycle counter goes backward
one a single CPU. :) Hrm, I thought those you say are truly insane
behaviors are exactly the ones we are trying to deal with ?

And what do we say when we detect this ? "sorry, please upgrade your
hardware to get a reliable trace" ? ;)

> We need periodic entries added to the buffer anyway
> to make sure we can detect rollover since we don't
> want waste space in log records with a full width
> TSC value.
>

Nope, this is not required. I removed the heartbeat event from LTTng two
weeks ago, implementing detection of the delta from the last timestamp
written into the trace. If we detect that the new timestamp is too far
from the previous one, we write the full 64 bits TSC in an extended
event header. Therefore, we have no dependency on interrupt latency to
get a sane time-base.


> -Tony
>

Here are some numbers showing the scalability of synchronized TSC vs
cache-line bouncing vs HPET read under tracing load. I use LTTng to take
a trace only in circular per-cpu memory buffers while tbench is running.
I look at the resulting tbench speed. This kind of load generates a lot
of tracing data especially because tbench does a lot of small
read/writes which generates a lot of system call events. Side-note:
LTTng is currently fully dynamic and parses the format string like
printk, and this is accountable for a large part of the performance
degradation. LTTng however supports to override this probe with
"specialized" probes which know exactly which types to record. I just
did not create any yet. So let's focus on timestamping :


model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 2000.073

tbench, x86_64 dual quad-core, 2.0GHz, 16GB ram Speed Slowdown

(8 cores up)
No tracing : 1910.50 MB/sec
Flight recorder tracing (per-cpu memory buffers)
synchronized TSC, get_cycles with cpuid : 940.20 MB/sec (50%)
unsync TSC, get_cycles + cmpxchg : 716.96 MB/sec (62%)
unsync TSC, HPET read : 586.53 MB/sec (69%)

(2 cores up)
No tracing : 488.15 MB/sec
Flight recorder tracing (per-cpu memory buffers)
synchronized TSC, get_cycles with cpuid : 241.34 MB/sec (50%)
unsync TSC, get_cycles + cmpxchg : 202.30 MB/sec (58%)
unsync TSC, HPET read : 187.04 MB/sec (61%)

(1 core up)
No tracing : 270.67 MB/sec
Flight recorder tracing (per-cpu memory buffers)
synchronized TSC, get_cycles with cpuid : 126.82 MB/sec (53.1%)
unsync TSC, get_cycles + cmpxchg : 124.54 MB/sec (53.9%)
unsync TSC, HPET read : 98.75 MB/sec (63.5%)

So, the conclusion it brings about scalability of those time sources
regarding tracing is :
- local TSC read scales very well when the number of CPU increases
(constant 50% overhead)
- Comparing the added overhead of both get_cyles+cmpxchg and HPET to
the local sync TSC :

cores get_cycles+cmpxchg HPET
1 0.8% 10%
2 8 % 11%
8 12 % 19%

So, is it me, or HPET scales even more poorly than a cache-line bouncing
cmpxchg ? I find it a bit surprising.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/