Re: [PATCH 1/3] timekeeping: NMI safe converter from a given time to monotonic

From: John Stultz
Date: Tue Jan 24 2023 - 15:33:59 EST


On Tue, Jan 24, 2023 at 12:13 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> On 2023-01-24 1:43 p.m., John Stultz wrote:
> > On Tue, Jan 24, 2023 at 7:09 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >> On 2023-01-24 2:01 a.m., John Stultz wrote:
> >>> On Mon, Jan 23, 2023 at 10:27 AM <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >>>> + /*
> >>>> + * Check whether the given timestamp is on the current
> >>>> + * timekeeping interval.
> >>>> + */
> >>>> + now = tk_clock_read(tkr);
> >>>> + interval_start = tkr->cycle_last;
> >>>> + if (!cycle_between(interval_start, cycles, now))
> >>>> + return -EOPNOTSUPP;
> >>>
> >>> So. I've not fully thought this out, but it seems like it would be
> >>> quite likely that you'd run into the case where the cycle_last value
> >>> is updated and your earlier TSC timestamp isn't valid for the current
> >>> interval. The get_device_system_crosststamp() logic has a big chunk of
> >>> complex code to try to handle this case by interpolating the cycle
> >>> value back in time. How well does just failing in this case work out?
> >>>
> >>
> >> For the case, perf fallback to the time captured in the NMI handler, via
> >> ktime_get_mono_fast_ns().
> >
> > This feels like *very* subtle behavior. Maybe I'm misunderstanding,
> > but the goal seems to be to have more accurate timestamps on the hw
> > events, and using the captured tsc timestamp avoids the measuring
> > latency reading the time again. But if every timekeeping update
> > interval (~tick) you transparently get a delayed value due to the
> > fallback, it makes it hard to understand which timestamps are better
> > or worse. The latency between two reads may be real or it may be just
> > bad luck. This doesn't intuitively seem like a great benefit over more
> > consistent latency of just using the ktime_get_mono_fast()
> > timestamping.
>
> Your understand is correct. We want a more accurate timestamp for the
> analysis work.
>
> As my understanding, the timekeeping update should not be very often. If

"Often" depends on your your timescale.

> I read the code correctly, it should happen only when adjusting NTP or
> suspending/resuming. If so, I think the drawback should not impact the
> normal analysis work. I will call out the drwabacks in the comments
> where the function is used.

So the adjustments are done at tick time depending on the current NTP
"error" (basically what the kernel tracks as the delta from its sense
of what NTP has told us).

Not just at the time when ntp makes an adjustment.

So the window for it to happen is every timekeeping update (which is ~HZ).


> >> The TSC in PEBS is captured by HW when the sample was generated. There
> >> should be a small delta compared with the time captured in the NMI
> >> handler. But I think the delta should be acceptable as a backup solution
> >> for the most analysis cases. Also, I don't think the case (the
> >> cycle_last value is updated during the monitoring) should occur very
> >> often either. So I drop the history support to simplify the function.
> >
> > So the reads and this function are *always* used in NMI context? Has
> > this been stressed with things like SMIs to see how it does if
> > interrupted in those cases?
>
> Yes, it's *always* and only used in NMI context.

Thanks, that is helpful to clarify.

> > My worry is that (as I bored everyone earlier), the
> > ktime_get_*_fast_ns() interfaces already have some sharp edges and
> > need a fair amount of thought as to when they should be used. This is
> > sort of compounding that adding an interface that has further special
> > cases where it can fail, making it difficult to fully understand and
> > easier to accidentally misuse.
> >
> > My other concern is that interfaces always get stretched and used
> > beyond anything they were initially planned for (see the
> > ktime_get_*fast_ns() interfaces here as an example! :), and in this
> > case the logic seems to have lots of implicit dependencies on the
> > facts of your specific use case, so it seems a bit fragile should
> > folks on other architectures with other constraints try to use it.
> >
> > So I just want to push a bit to think how you might be able to
> > extend/generalize the existing get_system_crosststamp for your
> > purposes, or alternatively find a way to simplify the logic's behavior
> > so its less tied to specific constraints ("this works most of the time
> > from NMI, but otherwise no promises"). Or at least some better
> > documentation around the code, its uses and its constraints? ( "NMI
> > safe" is not the same as "Only safe to use from NMI" :)
>
> Since our usage is fixed (only in NMI), I prefer the latter. I think
> extending/generalizing the existing function only makes the function
> extremely complex and low efficient. The new function should have the
> same constraints as the existing ktime_get_mono_fast_ns(). Since perf
> can live with the ktime_get_mono_fast_ns(), there should be no problem
> with the new function for the constraints. I will add more comments to
> clarify the usage and constraints to avoid the abuse of the new function.

I agree the existing function is complex, so adding more complexity
isn't ideal, but potentially breaking it up or reworking it might be
better. Having two similar but different implementations is also a
complexity. So I just want to make sure this is well considered. But
clearer documentation as a first step will help.

Thanks
-john