Re: [RFC PATCH V2 2/9] perf: Extend ABI to support post-processing monotonic raw conversion

From: Liang, Kan
Date: Tue Feb 14 2023 - 12:01:48 EST




On 2023-02-14 9:51 a.m., Liang, Kan wrote:
>
>
> On 2023-02-13 5:22 p.m., John Stultz wrote:
>> On Mon, Feb 13, 2023 at 1:40 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>> On 2023-02-13 2:37 p.m., John Stultz wrote:
>>>> On Mon, Feb 13, 2023 at 11:08 AM <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
>>>>>
>>>>> The monotonic raw clock is not affected by NTP/PTP correction. The
>>>>> calculation of the monotonic raw clock can be done in the
>>>>> post-processing, which can reduce the kernel overhead.
>>>>>
>>>>> Add hw_time in the struct perf_event_attr to tell the kernel dump the
>>>>> raw HW time to user space. The perf tool will calculate the HW time
>>>>> in post-processing.
>>>>> Currently, only supports the monotonic raw conversion.
>>>>> Only dump the raw HW time with PERF_RECORD_SAMPLE, because the accurate
>>>>> HW time can only be provided in a sample by HW. For other type of
>>>>> records, the user requested clock should be returned as usual. Nothing
>>>>> is changed.
>>>>>
>>>>> Add perf_event_mmap_page::cap_user_time_mono_raw ABI to dump the
>>>>> conversion information. The cap_user_time_mono_raw also indicates
>>>>> whether the monotonic raw conversion information is available.
>>>>> If yes, the clock monotonic raw can be calculated as
>>>>> mono_raw = base + ((cyc - last) * mult + nsec) >> shift
>>>>
>>>> Again, I appreciate you reworking and resending this series out, I
>>>> know it took some effort.
>>>>
>>>> But oof, I'd really like to make sure we're not exporting timekeeping
>>>> internals to userland.
>>>>
>>>> I think Thomas' suggestion of doing the timestamp conversion in
>>>> post-processing was more about interpolating collected system times
>>>> with the counter (tsc) values captured.
>>>>
>>>
>>> Thomas, could you please clarify your suggestion regarding "the relevant
>>> conversion information" provided by the kernel?
>>> https://lore.kernel.org/lkml/87ilgsgl5f.ffs@tglx/
>>>
>>> Is it only the interpolation information or the entire conversion
>>> information (Mult, shift etc.)?
>>>
>>> If it's only the interpolation information, the user space will be lack
>>> of information to handle all the cases. If I understand John's comments
>>> correctly, it could also bring some interpolation error which can only
>>> be addressed by the mult/shift conversion.
>>
>
>
> Thanks for the details John.
>
>> "Only" is maybe too strong a word. I think having the driver use
>> kernel timekeeping accessors to CLOCK_MONONOTONIC_RAW time with
>> counter values will minimize the error.
>>
>
> The key motivation of using the TSC in the PEBS record is to get an
> accurate timestamp of each record. We definitely want the conversion has
> minimized error.
>
>
>> But again, it's not yet established that any interpolation error using
>> existing interfaces is great enough to be problematic here.
>>
>> The interpoloation is pretty easy to do:
>>
>> do {
>> start= readtsc();
>> clock_gett(CLOCK_MONOTONIC_RAW, &ts);
>> end = readtsc();
>> delta = end-start;
>> } while (delta > THRESHOLD) // make sure the reads were not preempted
>> mid = start + (delta +(delta/2))/2; //round-closest
>>
>
> How to choose the THRESHOLD? It seems the THRESHOLD value also impacts
> the accuracy.
>
>
>> and be able to get you a fairly close matching of TSC to
>> CLOCK_MONOTONIC_RAW value.
>>
>> Once you have that mapping you can take a few samples and establish
>> the linear function.
>>
>> But that will have some error, so quantifying that error helps
>> establish why being able to get an atomic mapping of TSC ->
>> CLOCK_MONOTONIC_RAW would help.
>>
>> So I really don't think we need to expose the kernel internal values
>> to userland, but I'm willing to guess the atomic mapping (which the
>> driver will have access to, not userland) may be helpful for the fine
>> granularity you want in the trace.
>>
>
> If I understand correctly, the idea is to let the user space tool run
> the above interpoloation algorithm several times to 'guess' the atomic
> mapping. Using the mapping information to covert the TSC from the PEBS
> record. Is my understanding correct?
>
> If so, to be honest, I doubt we can get the accuracy we want.
>

I implemented a simple test to evaluate the error.

I collected TSC -> CLOCK_MONOTONIC_RAW mapping using the above algorithm
at the start and end of perf cmd.
MONO_RAW TSC
start 89553516545645 223619715214239
end 89562251233830 223641517000376

Here is what I get via mult/shift conversion from this patch.
MONO_RAW TSC
PEBS 89555942691466 223625770878571

Then I use the time information from start and end to create a linear
function and 'guess' the MONO_RAW of PEBS from the TSC. I get
89555942692721.
There is a 1255 ns difference.
I tried several different PEBS records. The error is ~1000ns.
I think it should be an observable error.

Thanks,
Kan