Re: [PATCH V3 0/1] kvm: Output TSC offset

From: Gleb Natapov
Date: Sun Jun 23 2013 - 03:59:10 EST


On Wed, Jun 12, 2013 at 04:43:41PM +0900, Yoshihiro YUNOMAE wrote:
> Hi All,
>
Applied, thanks.

> I'd like to propose a patch which adds a tracepoint at write_tsc_offset for
> tracing guests TSC offset changes. It is required for sorting the trace data of
> a guest and the host in chronological order.
>
> In a virtualization environment, it is difficult to analyze performance
> problems, such as a delay of I/O request on a guest. This is because multiple
> guests operate on the host. One of approaches for solving such kind of problems
> is to sort trace data of guests and the host in chronological order.
>
> Raw TSC can be chosen as a timestamp of ftrace. I think TSC is useful for
> merging trace data in chronological order by two reasons. One of the reasons is
> that guests can directly read raw TSC from the CPU using rdtsc operation. This
> means that raw TSC value is not software clock like sched_clock, so we don't
> need to consider about how the timestamp is calculated. The other is that TSC
> of recent x86 CPUs is constantly incremented. This means that we don't need to
> worry about pace of the timestamp. Therefore, choosing TSC as a timestamp for
> tracing is reasonable to integrate trace data of guests and a host.
>
> Here, we need to consider about just one matter for using TSC on guests. TSC
> value on a guest is always the host TSC plus the guest's "TSC offset". In other
> words, to merge trace data using TSC as timestamp in chronological order, we
> need to consider TSC offset of the guest.
>
> However, only the host kernel can read the TSC offset from VMCS and TSC offset
> is not output in anywhere now. In other words, tools in userland cannot get
> the TSC offset value, so we cannot merge trace data of guest and the host in
> chronological order. Therefore, I think the TSC offset should be exported for
> userland tools.
>
> In this patch, TSC offset is exported by the tracepoint kvm_write_tsc_offset
> on the host. TSC offset events will be very rare event because guests may
> seldom execute write_tsc. So if we enable this event for a normal buffer of
> ftrace, the events will be overwritten by other events. For a recent linux
> kernel, a multiple buffer function is available. So, I recommend to enable
> this event for a sub buffer of ftrace.
>
> <Example>
> We assume that wakeup-latency for a command is big on a guest. Normally
> we will use ftrace's wakeup-latency tracer or event tracer on the guest, but we
> may not be able to solve this problem. This is because guests often exit to
> the host for several reasons. In the next, we will use TSC as ftrace's timestamp
> and record the trace data on the guest and the host. Then, we get following
> data:
>
> /* guest data */
> comm-3826 [000] d...49836825726903: sched_wakeup: [detail]
> comm-3826 [000] d...49836832225344: sched_switch: [detail]
> /* host data */
> qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
> qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
> qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
> qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
>
> Since TSC offset is not considered, these data cannot be merged. If this trace
> data is shown like as follows, we will be able to understand the reason:
>
> qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
> comm-3826 [000] d.h.49836825726903: sched_wakeup: [detail] <=
> qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
> qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
> comm-3826 [000] d...49836832225344: sched_switch: [detail] <=
> qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
> qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
>
> In this case, we can understand wakeup-latency was big due to exit to host
> twice. Getting this data sorted in chronological order is our goal.
>
> To merge the data like previous pattern, we apply this patch. Then, we can
> get TSC offset of the guest as follows:
>
> # cat /sys/kernel/debug/tracing/instances/tsc_offset/trace
> .. d...4300151845072: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446739773557710924
> ^ ^^^^^^^^^^^^^^^^^^^^
>
> We use this TSC offset value to a merge script and obtain the following data:
>
> $ ./trace-merge.pl -g guest_data -h host_data -t 18446739773557710924
>
> h qemu-kvm-2687 [003] d...50550079203669: kvm_exit: [detail]
> h qemu-kvm-2687 [003] d...50550079206816: kvm_entry: [detail]
> g comm-3826 [000] d.h.50550079226331: sched_wakeup: [detail] <=
> h qemu-kvm-2687 [003] d...50550079240656: kvm_exit: [detail]
> h qemu-kvm-2687 [003] d...50550079243467: kvm_entry: [detail]
> h qemu-kvm-2687 [003] d...50550079256103: kvm_exit: [detail]
> h qemu-kvm-2687 [003] d...50550079268391: kvm_entry: [detail]
> g comm-3826 [000] d...50550079279266: sched_switch: [detail] <=
> h qemu-kvm-2687 [003] d...50550079280829: kvm_exit: [detail]
> h qemu-kvm-2687 [003] d...50550079286028: kvm_entry: [detail]
> |
> \----guest/host
>
> In this summary, I suggest the patch which TSC offset for each guest can be
> output on the host.
>
> As one example, I will send a merge tool. This tool assumes the guest doesn't
> execute write_tsc.
>
> Changes in V2:
> - Use tracepoint for outputting TSC offset value instead of printk() [1/1]
>
> Changes in V3:
> - Add tracepoint in svm.c [1/1]
> - Record vcpu_id [1/1]
> - Change "previous=" to "prev=" in output result [1/1]
>
> Thanks!
>
> ---
>
> Yoshihiro YUNOMAE (1):
> kvm: Add a tracepoint write_tsc_offset
>
>
> arch/x86/kvm/svm.c | 10 +++++++++-
> arch/x86/kvm/trace.h | 21 +++++++++++++++++++++
> arch/x86/kvm/vmx.c | 7 ++++++-
> arch/x86/kvm/x86.c | 1 +
> 4 files changed, 37 insertions(+), 2 deletions(-)
>
> --
> Yoshihiro YUNOMAE
> Software Platform Research Dept. Linux Technology Center
> Hitachi, Ltd., Yokohama Research Laboratory
> E-mail: yoshihiro.yunomae.ez@xxxxxxxxxxx

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/