Re: [PATCH 1/2] perf/x86/intel/pt: Fail event scheduling on conflict with VMX

From: Alexander Shishkin
Date: Wed Feb 15 2017 - 03:11:51 EST


Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:

> On Tue, Feb 14, 2017 at 07:38:07PM +0100, Peter Zijlstra wrote:
>
>> Right, so I question the whole 'lets not schedule PT when VMX' premise,
>> it leads to inconsistencies all over. How about we treat it like
>> ->add() succeeded and VMX simply results in no output.
>>
>> Esp. when you then emit 'fake' data into/from a vmlaunch/vmresume
>> instruction.
>
> That is, what about something like the below? (completely untested for
> obvious raisins).
>
> That should schedule PT like normal, and where VMXON will auto-clear
> TraceEn for us, we make VMXOFF set it again.

This makes sense, but this will only make a difference if we're tracing
(the kernel side of) the process that did VMXON in the first place and
we want to see what happens immediately after VMXOFF.

>
> ---
> arch/x86/events/intel/pt.c | 38 ++++++++++++++++++++++++++------------
> arch/x86/events/intel/pt.h | 1 +
> 2 files changed, 27 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index 5900471ee508..a42fa1bef761 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -411,6 +411,7 @@ static u64 pt_config_filters(struct perf_event *event)
>
> static void pt_config(struct perf_event *event)
> {
> + struct pt *pt = this_cpu_ptr(&pt_ctx);
> u64 reg;
>
> if (!event->hw.itrace_started) {
> @@ -429,11 +430,14 @@ static void pt_config(struct perf_event *event)
> reg |= (event->attr.config & PT_CONFIG_MASK);
>
> event->hw.config = reg;
> - wrmsrl(MSR_IA32_RTIT_CTL, reg);
> +
> + if (!pt->vmx_on)
> + wrmsrl(MSR_IA32_RTIT_CTL, reg);
> }
>
> static void pt_config_stop(struct perf_event *event)
> {
> + struct pt *pt = this_cpu_ptr(&pt_ctx);
> u64 ctl = READ_ONCE(event->hw.config);
>
> /* may be already stopped by a PMI */
> @@ -441,7 +445,9 @@ static void pt_config_stop(struct perf_event *event)
> return;
>
> ctl &= ~RTIT_CTL_TRACEEN;
> - wrmsrl(MSR_IA32_RTIT_CTL, ctl);
> +
> + if (!pt->vmx_on)
> + wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>
> WRITE_ONCE(event->hw.config, ctl);
>
> @@ -1174,10 +1180,12 @@ void intel_pt_interrupt(void)
> /*
> * If VMX is on and PT does not support it, don't touch anything.
> */
> - if (READ_ONCE(pt->vmx_on))
> + if (READ_ONCE(pt->vmx_on)) {
> + WRITE_ONCE(pt->vmx_pmi_pending, 1);
> return;
> + }

This is even simpler: we actually need to carry out the first part of
the interrupt function anyway, which deals with updating buffer pointers
etc, thus "handling" the PMI, but we don't restart the event, which will
be then done by the intel_pt_handle_vmx(0), so we don't need the
pending_pmi thingy.

Now the fake data is worrying me much more. Consider this: we start an
event while ->vmx_on==1, which means that before we write a fake VMCS
packet, we need to write a whole bunch of other packets to establish the
context with synchronization point and kitchen sink (PSB..PSBEND), then
fake a trace start TIP.PGE, then VMCS, then TIP.PGD. Then, we should
remember that we did this once to not do it again on every sched-in.

Another corner case is when there's not enough room in the buffer and we
need to postpone the fake VMCS until there is room again. Let me see if
there's more.

Regards,
--
Alex