A quick test (shown below) gives the cost of a TLB miss on the Intel Xeon E5404:Let's say a trace entry occupies 40 bytes and a TLB miss costs 200
You seem to underestimate the frequency at which trace events can be generated.
E.g., by the time you run the scheduler once (which we can consider a very hot
kernel path), some tracing modes will generate thousands of events, which will
touch a very significant amount of TLB entries.
cycles on average. So we have 100 entries per page costing 200 cycles;
amortized each entry costs 2 cycles.
Number of cycles added over test baseline:
tlb and cache hit: 12.42
tlb hit, l2 hit, l1 miss 17.88
tlb hit,l2+l1 miss 32.34
tlb and cache miss 449.58
So it's closer to 500 per tlb miss.
Also, your analysis does not seem to correctly represent reality of the TLB
trashing cost. On a workload walking over a large number of random pages (e.g. a
large hash table) all the time, eating just a few more TLB entries will impact
the number of misses over the entire workload.
So it's not much the misses that we see at the tracing site that is the problem,
but also the extra misses taken by the application caused by the extra pressure
on TLB. So just a few more TLB entries taken by the tracer will likely hurt
these workloads.
There's an additional cost caused by the need to re-fill the TLB later,The performance hit is not taken if the scheduler schedules another thread with
but you incur that anyway if the scheduler caused a context switch.
the same mapping, only when it schedules a different process.
Of course, my assumptions may be completely off (likely larger entriesDepending on the tracer design, the avg. event size can range from 12 bytes
but smaller miss costs).
(lttng is very agressive in event size compaction) to about 40 bytes (perf); so
for this you are mostly right. However, as explained above, the TLB miss cost is
higher than you expected.
Has a vmalloc based implementation beenI tested it in the past, and must admit that I changed from a vmalloc-based
tested? It seems so much easier than the other alternatives.
implementation to page-based using software cross-page write primitives based on
feedback from Steven and Ingo. Diminishing TLB trashing seemed like a good
approach, and using vmalloc on 32-bit machines is a pain, because users have to
tweak the vmalloc region size at boot. So all in all, I moved to a vmalloc-less
implementation without much more thought.
If you feel we should test the performance of both approaches, we could do it in
the generic ring buffer library (it allows both type of allocation backends).
However, we'd have to find the right type of TLB-trashing real-world workload to
have meaningful results. This might be the hardest part.