Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

From: Xunlei Pang
Date: Tue Jul 24 2018 - 09:30:33 EST


On 7/23/18 5:21 PM, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
>> The trace data corresponds to the last sample period:
>> trace entry 1:
>> cat-20755 [022] d... 1370.106496: cputime_adjust: task
>> tick-based utime 362560000000 stime 2551000000, scheduler rtime 333060702626
>> cat-20755 [022] d... 1370.106497: cputime_adjust: result:
>> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
>> 2327067254
>>
>> trace entry 2:
>> cat-20773 [005] d... 1371.109825: cputime_adjust: task
>> tick-based utime 362567000000 stime 3547000000, scheduler rtime 334063718912
>> cat-20773 [005] d... 1371.109826: cputime_adjust: result:
>> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
>> 3236489210
>>
>> 1) expected behaviour
>> Let's compare the last two trace entries(all the data below is in ns):
>> task tick-based utime: 362560000000->362567000000 increased 7000000
>> task tick-based stime: 2551000000 ->3547000000 increased 996000000
>> scheduler rtime: 333060702626->334063718912 increased 1003016286
>>
>> The application actually runs almost 100%sys at the moment, we can
>> use the task tick-based utime and stime increased to double check:
>> 996000000/(7000000+996000000) > 99%sys
>>
>> 2) the current cputime_adjust() inaccurate result
>> But for the current cputime_adjust(), we get the following adjusted
>> utime and stime increase in this sample period:
>> adjusted utime: 330733635372->330827229702 increased 93594330
>> adjusted stime: 2327067254 ->3236489210 increased 909421956
>>
>> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
>>
>> 3) root cause
>> The root cause of the issue is that the current cputime_adjust() always
>> passes the whole times to scale_stime() to split the whole utime and
>> stime. In this patch, we pass all the increased deltas in 1) within
>> user's sample period to scale_stime() instead and accumulate the
>> corresponding results to the previous saved adjusted utime and stime,
>> so guarantee the accurate usr and sys increase within the user sample
>> period.
>
> But why it this a problem?
>
> Since its sample based there's really nothing much you can guarantee.
> What if your test program were to run in userspace for 50% of the time
> but is so constructed to always be in kernel space when the tick
> happens?
>
> Then you would 'expect' it to be 50% user and 50% sys, but you're also
> not getting that.
>
> This stuff cannot be perfect, and the current code provides 'sensible'
> numbers over the long run for most programs. Why muck with it?
>

Basically I am ok with the current implementation, except for one
scenario we've met: when kernel went wrong for some reason with 100% sys
suddenly for seconds(even trigger softlockup), the statistics monitor
didn't reflect the fact, which confused people. One example with our
per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in
fact the kernel was in some busy loop(100% sys) at that moment, and the
tick based time are of course all sys samples in such case.