Re: [RFC PATCH V2 2/9] perf: Extend ABI to support post-processing monotonic raw conversion

From: Liang, Kan
Date: Tue Mar 21 2023 - 11:27:05 EST


Hi John,

On 2023-03-18 2:02 a.m., John Stultz wrote:
> On Mon, Mar 13, 2023 at 2:19 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 2023-03-11 12:55 a.m., John Stultz wrote:
>>> On Thu, Mar 9, 2023 at 8:56 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>> On 2023-03-08 8:17 p.m., John Stultz wrote:
>>>>> So I spent a little bit of time today adding some trace_printks to the
>>>>> timekeeping code so I could record the actual TSC and timestamps being
>>>>> calculated from CLOCK_MONOTONIC_RAW.
>>>>>
>>>>> I did catch one error in the test code, which unfortunately I'm to blame for:
>>>>> mid = start + (delta +(delta/2))/2; //round-closest
>>>>>
>>>>> That should be
>>>>> mid = start + (delta +(2/2))/2 //round-closest
>>>>> or more simply
>>>>> mid = start + (delta +1)/2; //round-closest
>>>>>
>>>>> Generalized rounding should be: (value + (DIV/2))/DIV), but I'm
>>>>> guessing with two as the divisor, my brain mixed it up and typed
>>>>> "delta". My apologies!
>>>>>
>>>>> With that fix, I'm seeing closer to ~500ns of error in the
>>>>> interpolation, just using the userland sampling. Now, I've also
>>>>> disabled vsyscalls for this (otherwise I wouldn't be able to
>>>>> trace_printk), so the error likely would be higher than with
>>>>> vsyscalls.
>>>>>
>>>>> Now, part of the error is that:
>>>>> start= rdtsc();
>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
>>>>> end = rdtsc();
>>>>>
>>>>> Ends up looking like
>>>>> start= rdtsc();
>>>>> clock_gettime() {
>>>>> now = rdtsc();
>>>>> delta = now - last;
>>>>> ns = (delta * mult) >> shift
>>>>> [~midpoint~]
>>>>> ts->nsec = base_ns + ns;
>>>>> ts->sec = base_sec;
>>>>> normalize_ts(ts)
>>>>> }
>>>>> end = rdtsc();
>>>>>
>>>>> And so by taking the mid-point we're always a little skewed from where
>>>>> the tsc was actually read. Looking at the data for my case the tsc
>>>>> read seems to be ~12% in, so you could instead try:
>>>>>
>>>>> delta = end - start;
>>>>> p12 = start + ((delta * 12) + (100/2))/100;
>>>>>
>>>>> With that adjustment, I'm seeing error around ~40ns.
>>>>>
>>>>> Mind giving that a try?
>>>>
>>>> I tried both the new mid and p12. The error becomes even larger.
>>>>
>>>> With new mid (start + (delta +1)/2), the error is now ~3800ns
>>>> With p12 adjustment, the error is ~6700ns.
>>>>
>>>>
>>>> Here is how I run the test.
>>>> $./time
>>>> $perf record -e cycles:upp --clockid monotonic_raw $some_workaround
>>>> $./time
>>>>
>>>> Here are some raw data.
>>>>
>>>> For the first ./time,
>>>> start: 961886196018
>>>> end: 961886215603
>>>> MONO_RAW: 341485848531
>>>>
>>>> For the second ./time,
>>>> start: 986870117783
>>>> end: 986870136152
>>>> MONO_RAW: 351495432044
>>>>
>>>> Here is the time generated from one PEBS record.
>>>> TSC: 968210217271
>>>> PEBS_MONO_RAW (calculated via kernel conversion information): 344019503072
>>>>
>>>> Using new mid (start + (delta +1)/2), the guessed PEBS_MONO_RAW is
>>>> 344019506897. The error is 3825ns.
>>>> Using p12 adjustment, the guessed PEBS_MONO_RAW is 344019509831.
>>>> The error is 6759ns
>>>
>>> Huh. I dunno. That seems wild that the error increased.
>>>
>>> Just in case something is going astray with the PEBS_MONO_RAW logic,
>>> can you apply the hack patch I was using to display the MONOTONIC_RAW
>>> values the kernel calculates?
>>> https://github.com/johnstultz-work/linux-dev/commit/8d7896b078965b059ea5e8cc21841580557f6df6
>>>
>>> It uses trace_printk, so you'll have to cat /sys/kernel/tracing/trace
>>> to get the output.
>>>
>>
>>
>> $ ./time_3
>> start: 7358368893806 end: 7358368902944 delta: 9138
>> MONO_RAW: 2899739790738
>> MID: 7358368898375
>> P12: 7358368894903
>> $ sudo cat /sys/kernel/tracing/trace | grep time_3
>> time_3-1443 [002] ..... 2899.858936: ktime_get_raw_ts64:
>> JDB: timekeeping_get_delta cycle_now: 7358368897679
>> time_3-1443 [002] ..... 2899.858937: ktime_get_raw_ts64:
>> JDB: ktime_get_raw_ts64: 2899739790738
>>
>> The error between MID and cycle_now is -696ns
>> The error between P12 and cycle_now is 2776ns
>
> Hey Kan,
> So I'm terribly sorry, I'm a bit underwater right now and haven't
> had time to look deeper at this. The MID case you have above looks
> closer to what I was seeing but I can't explain why the 12% case is
> worse.
>
> Since I feel it's not really fair to object to your patch but not have
> the time to work through an alternative with you, I'm going to
> withdraw my objection (though others may persist!).
> I'd still really prefer if we avoided exposing internal timekeeping
> state directly to userland, and it would be good to see some further
> exploration in other directions, but there is the existing perf mmap
> precedence (even if I dislike it). Sorry I can't be of more help to
> find a better approach here. :(
>

Thank you all the same. I think we learnt that there should be more work
for the pure user space solution. It is not a solution for the monotonic
raw conversion for now.

I have no idea how to do the post-processing conversion without the
internal conversion information.
So, for now, there seems only two candidate solutions.
- Pure kernel solution (Similar to V1).
- Expose the internal conversion information to the user space and does
post-processing conversion. (V2)

I will ping Thomas in the other thread and see if he has any suggestions.

Thanks,
Kan