Re: [BUG nohz]: wrong user and system time accounting

From: Luiz Capitulino
Date: Fri Mar 31 2017 - 23:16:25 EST


On Sat, 1 Apr 2017 01:24:54 +0200
Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:

> On Fri, Mar 31, 2017 at 04:09:10PM -0400, Luiz Capitulino wrote:
> > On Thu, 30 Mar 2017 17:25:46 -0400
> > Luiz Capitulino <lcapitulino@xxxxxxxxxx> wrote:
> >
> > > On Thu, 30 Mar 2017 16:18:17 +0200
> > > Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> > >
> > > > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote:
> > > > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker <fweisbec@xxxxxxxxx>:
> > > > > > If it works, we may want to take that solution, likely less performance sensitive
> > > > > > than using sched_clock(). In fact sched_clock() is fast, especially as we require it to
> > > > > > be stable for nohz_full, but using it involves costly conversion back and forth to jiffies.
> > > > >
> > > > > So both Rik and you agree with the skew tick solution, I will try it
> > > > > tomorrow. Btw, if we should just add random offset to the cpu in the
> > > > > nohz_full mode or add random offset to all cpus like the codes above?
> > > >
> > > > Lets just keep it to all CPUs for simplicty.
> > > > Also please add a comment that explains why we need that skew_tick on nohz_full.
> > >
> > > I've tried all the test-cases we discussed in this thread with skew_tick=1
> > > and it worked as expected in bare-metal and KVM guests.
> > >
> > > However, I found a test-case that works in bare-metal but show problems
> > > in KVM guests. It could something that's KVM specific, or it could be
> > > something that's harder to reproduce in bare-metal.
> >
> > After discussing some findings on this issue with Rik, I realized that
> > we don't add the skew when restarting the tick in tick_nohz_restart().
> > Adding the offset there seems to solve this problem.
>
> Are you sure? tick_nohz_restart() doesn't seem to override the initial skew. It
> always forwards the expiration time on top of the last tick.

OK, I'll double check. Without my change the bug triggers almost
instantly with the described reproducer. With my change it didn't
trig for several minutes (but it does look wrong looking at it now).