Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)

From: Rik van Riel
Date: Tue Aug 23 2016 - 17:46:28 EST


On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
> On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> > On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > > [...]
> > > > > I have no idea why those numbers are so different on my
> > > > > laptop
> > > > > yet. It surely looks suspicious. I will try to debug this
> > > > > further
> > > > > tomorrow.
> > > >
> > > > Hmm, so I've tried to use my version of awk on other machine
> > > > and vice
> > > > versa and it didn't make any difference. So this is independent
> > > > on the
> > > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0,
> > > > {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128
> > > >
> > > > so the kernel indeed reports 0 user time for some reason. Note
> > > > I
> > > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no
> > > > local
> > > > modifications). The other machine which reports non-0 utime is
> > > > 3.12
> > > > SLES kernel. Maybe I am hitting some accounting bug. At first I
> > > > was
> > > > suspecting CONFIG_NO_HZ_FULL because that is the main
> > > > difference between
> > > > my and the other machine but then I've noticed that the tests I
> > > > was
> > > > doing in kvm have this disabled too.. so it must be something
> > > > else.
> > >
> > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is
> > > the same
> > > in both kernels.
> >
> > and one more thing. It is not like utime accounting would be
> > completely
> > broken and always report 0. Other commands report non-0 values even
> > on
> > 4.6 kernels. I will try to bisect this down later today.
>
> OK, so it seems I found it. I was quite lucky because
> account_user_time
> is not all that popular function and there were basically no changes
> besides Riks ff9a9b4c4334 ("sched, time: Switch
> VIRT_CPU_ACCOUNTING_GEN
> to jiffy granularity") and that seems to cause the regression.
> Reverting
> the commit on top of the current mmotm seems to fix the issue for me.
>
> And just to give Rik more context. While debugging overhead of the
> /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> -v
> (source for ./max_mmap is [1])
>
> root@test1:~# uname -r
> 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> root@test1:~# ./max_mapÂ
> pid:2990 maps:65515
> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> rss:263368 pss:262203
> ÂÂÂÂÂÂÂÂCommand being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> ÂÂÂÂÂÂÂÂUser time (seconds): 0.00
> ÂÂÂÂÂÂÂÂSystem time (seconds): 0.45
> ÂÂÂÂÂÂÂÂPercent of CPU this job got: 98%
>

> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> rss:263316 pss:262199
> ÂÂÂÂÂÂÂÂCommand being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> ÂÂÂÂÂÂÂÂUser time (seconds): 0.18
> ÂÂÂÂÂÂÂÂSystem time (seconds): 0.29
> ÂÂÂÂÂÂÂÂPercent of CPU this job got: 97%

The patch in question makes user and system
time accounting essentially tick-based. If
jiffies changes while the task is in user
mode, time gets accounted as user time, if
jiffies changes while the task is in system
mode, time gets accounted as system time.

If you get "unlucky", with a job like the
above, it is possible all time gets accounted
to system time.

This would be true both with the system running
with a periodic timer tick (before and after my
patch is applied), and in nohz_idle mode (after
my patch).

However, it does seem quite unlikely that you
get zero user time, since you have 125 timer
ticks in half a second. Furthermore, you do not
even have NO_HZ_FULL enabled...

Does the workload consistently get zero user
time?

If so, we need to dig further to see under
what precise circumstances that happens.

On my laptop, with kernelÂ4.6.3-300.fc24.x86_64
I get this:

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
"rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
rss:263368 pss:262145
Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
{printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
User time (seconds): 0.64
System time (seconds): 0.19
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83

The main difference between your and my
NO_HZ config seems to be that NO_HZ_FULL
is set here. However, it is not enabled
at run time, so both of our systems
should only really get NO_HZ_IDLE
effectively.

Running tasks should get sampled with the
regular timer tick, while they are running.

In other words, vtime accounting should be
disabled in both of our tests, for everything
except the idle task.

Do I need to do anything special to reproduce
your bug, besides running the max mmap program
and the awk script?