Here are some thoughts on how 'perf record' tracing performance could be
further improved:
1)
The use of non-temporal stores (MOVNTQ) to copy the ring-buffer into the
file buffer makes sure the CPU cache is not trashed by the copying - which
is the largest 'collateral damage' copying does.
glibc does not appear to expose non-temporal instructions so it's going to
be architecture dependent - but we could build the copy_user_nocache()
function from the kernel proper (or copy it - we could even simplify it:
knowing that only large and page aligned buffers are going to be copied
with it).
See how tools/perf/bench/mem-mem* does that to be able to measure the
kernel's memcpy() and memset() function performance.
2)
Yet another method would be to avoid the copies altogether via the splice
system-call - see:
git grep splice kernel/trace/
To make splice low-overhead we'd have to introduce a mode to not mmap the
data part of the perf ring-buffer and splice the data straight from the
perf fd into a temporary pipe and over from the pipe into the target file
(or socket).
OTOH non-temporal stores are incredibly simple and memory bandwidth is
plenty on modern systems so I'd certainly try that route first.