utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)

From: Michal Hocko
Date: Tue Aug 23 2016 - 10:33:39 EST


On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > [...]
> > > > I have no idea why those numbers are so different on my laptop
> > > > yet. It surely looks suspicious. I will try to debug this further
> > > > tomorrow.
> > >
> > > Hmm, so I've tried to use my version of awk on other machine and vice
> > > versa and it didn't make any difference. So this is independent on the
> > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128
> > >
> > > so the kernel indeed reports 0 user time for some reason. Note I
> > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > > modifications). The other machine which reports non-0 utime is 3.12
> > > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > > my and the other machine but then I've noticed that the tests I was
> > > doing in kvm have this disabled too.. so it must be something else.
> >
> > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> > in both kernels.
>
> and one more thing. It is not like utime accounting would be completely
> broken and always report 0. Other commands report non-0 values even on
> 4.6 kernels. I will try to bisect this down later today.

OK, so it seems I found it. I was quite lucky because account_user_time
is not all that popular function and there were basically no changes
besides Riks ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN
to jiffy granularity") and that seems to cause the regression. Reverting
the commit on top of the current mmotm seems to fix the issue for me.

And just to give Rik more context. While debugging overhead of the
/proc/<pid>/smaps I am getting a misleading output from /usr/bin/time -v
(source for ./max_mmap is [1])

root@test1:~# uname -r
4.5.0-rc6-bisect1-00025-gff9a9b4c4334
root@test1:~# ./max_map
pid:2990 maps:65515
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
rss:263368 pss:262203
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
User time (seconds): 0.00
System time (seconds): 0.45
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1796
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 83
Voluntary context switches: 6
Involuntary context switches: 6
Swaps: 0
File system inputs: 248
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

See the User time being 0 (as you can see above in the quoted text it
is not a rounding error in userspace or something similar because wait4
really returns 0). Now with the revert
root@test1:~# uname -r
4.5.0-rc6-revert-00026-g7fc86f968bf5
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
rss:263316 pss:262199
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
User time (seconds): 0.18
System time (seconds): 0.29
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1760
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 79
Voluntary context switches: 5
Involuntary context switches: 7
Swaps: 0
File system inputs: 248
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

So it looks like the whole user time is accounted as the system time.
My config is attached and yes I do have CONFIG_VIRT_CPU_ACCOUNTING_GEN
enabled. Could you have a look please?

[1] http://lkml.kernel.org/r/20160817082200.GA10547@xxxxxxxxxxxxxx
--
Michal Hocko
SUSE Labs

Attachment: .config.gz
Description: application/gzip