Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

From: Liang, Kan
Date: Tue Jan 24 2023 - 17:08:11 EST




On 2023-01-24 3:33 p.m., John Stultz wrote:
> On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>> On 2023-01-24 1:43 p.m., John Stultz wrote:
>>> On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>> On 2023-01-24 2:01 a.m., John Stultz wrote:
>>>>> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>> + /*
>>>>>> + * Check whether the given timestamp is on the current
>>>>>> + * timekeeping interval.
>>>>>> + */
>>>>>> + now = tk_clock_read(tkr);
>>>>>> + interval_start = tkr->cycle_last;
>>>>>> + if (!cycle_between(interval_start, cycles, now))
>>>>>> + return -EOPNOTSUPP;
>>>>> So. I've not fully thought this out, but it seems like it would be
>>>>> quite likely that you'd run into the case where the cycle_last value
>>>>> is updated and your earlier TSC timestamp isn't valid for the current
>>>>> interval. The get_device_system_crosststamp() logic has a big chunk of
>>>>> complex code to try to handle this case by interpolating the cycle
>>>>> value back in time. How well does just failing in this case work out?
>>>>>
>>>> For the case, perf fallback to the time captured in the NMI handler, via
>>>> ktime_get_mono_fast_ns().
>>> This feels like *very* subtle behavior. Maybe I'm misunderstanding,
>>> but the goal seems to be to have more accurate timestamps on the hw
>>> events, and using the captured tsc timestamp avoids the measuring
>>> latency reading the time again. But if every timekeeping update
>>> interval (~tick) you transparently get a delayed value due to the
>>> fallback, it makes it hard to understand which timestamps are better
>>> or worse. The latency between two reads may be real or it may be just
>>> bad luck. This doesn't intuitively seem like a great benefit over more
>>> consistent latency of just using the ktime_get_mono_fast()
>>> timestamping.
>> Your understand is correct. We want a more accurate timestamp for the
>> analysis work.
>>
>> As my understanding, the timekeeping update should not be very often. If
> "Often" depends on your your timescale.
>
>> I read the code correctly, it should happen only when adjusting NTP or
>> suspending/resuming. If so, I think the drawback should not impact the
>> normal analysis work. I will call out the drwabacks in the comments
>> where the function is used.
> So the adjustments are done at tick time depending on the current NTP
> "error" (basically what the kernel tracks as the delta from its sense
> of what NTP has told us).
>
> Not just at the time when ntp makes an adjustment.
>
> So the window for it to happen is every timekeeping update (which is ~HZ).

You mean the tkf->base is updated in each tick?
If so, that should be a problem.

Does the NTP "error" consistently happen on all the platforms? Or just
on some platforms which we can check and limit the usage?

There are two configurations for PEBS, single PEBS, and large PEBS.
For the single PEBS mode, HW triggers a NMI for each record. The TSC is
the time which the record is generated, and we convert it in the same
NMI. I don't think the NTP "error" can impact it.

But for the large PEBS mode, HW triggers a NMI only when a fixed number
of records are generated. It's very likely that the base has been
updated between the records. That could bring troubles, since we can
only fall back to the NMI handler time of the last record. I'm afraid we
have to disable the large PEBS mode.

Stephane,

What do you think?
Is it still useful for you if we only support the single PEBS?


Thanks,
Kan