Re: [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs

From: Frederic Weisbecker
Date: Tue Nov 22 2016 - 08:46:21 EST


On Mon, Nov 21, 2016 at 11:17:28AM +0100, Martin Schwidefsky wrote:
> On Mon, 21 Nov 2016 07:59:56 +0100
> Martin Schwidefsky <schwidefsky@xxxxxxxxxx> wrote:
>
> > On Fri, 18 Nov 2016 15:47:02 +0100
> > Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> >
> > > On Fri, Nov 18, 2016 at 01:08:46PM +0100, Martin Schwidefsky wrote:
> > > > On Thu, 17 Nov 2016 19:08:07 +0100
> > > > Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> > > >
> > > > Now it has been proposed to implement lazy accounting to accumulate deltas
> > > > and do the expensive conversions only infrequently. This is pretty straight-
> > > > forward for account_user_time but to do this for the account_system_time
> > > > function is more complicated. The function has to differentiate between
> > > > guest/hardirq/softirq and pure system time. We would need to keep sums for
> > > > each bucket and provide a separate function to add to each bucket. Like
> > > > account_guest_time(), account_hardirq_time(), account_softirq_time() and
> > > > account_system_time(). Then it is up to the arch code to sort out the details
> > > > and call the accounting code once per jiffy for each of the buckets.
> > >
> > > That wouldn't be too hard really. The s390 code in vtime.c already does that.
> >
> > Yes, I agree that the accumulating change would not be too hard. Can I make the
> > request that we try to get that done first before doing the cleanup ?
>
> Played with the idea a bit, here is a prototype patch to do the delay system time
> accounting for s390. It applies against the latest s390 features tree which you'll
> find here
>
> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git features
>
> The details probably needs some more work but it works.
>
> --
> From 1b5ef9ddf899da81a48de826f783b15e6fc45d25 Mon Sep 17 00:00:00 2001
> From: Martin Schwidefsky <schwidefsky@xxxxxxxxxx>
> Date: Mon, 21 Nov 2016 10:44:10 +0100
> Subject: [PATCH] s390/cputime: delayed accounting of system time
>
> The account_system_time() function is called with a cputime that
> occurred while running in the kernel. The function detects which
> context the CPU is currently running in and accounts the time to
> the correct bucket. This forces the arch code to account the
> cputime for hardirq and softirq immediately.
>
> Make account_guest_time non-static and add account_sys_time,
> account_hardirq_time and account_softirq_time. With these functions
> the arch code can delay the accounting for system time. For s390
> the accounting is done once per timer tick and for each task switch.
>
> Signed-off-by: Martin Schwidefsky <schwidefsky@xxxxxxxxxx>

Thanks a lot for taking care of that! I'll give a try to do the same
on powerpc.

A few comments below:

> ---
> arch/s390/include/asm/lowcore.h | 65 ++++++++++++-----------
> arch/s390/include/asm/processor.h | 3 ++
> arch/s390/kernel/vtime.c | 106 ++++++++++++++++++++++----------------
> include/linux/kernel_stat.h | 13 +++--
> kernel/sched/cputime.c | 22 +++++++-
> 5 files changed, 129 insertions(+), 80 deletions(-)
>
> diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
> index 62a5cf1..8a5b082 100644
> --- a/arch/s390/include/asm/lowcore.h
> +++ b/arch/s390/include/asm/lowcore.h
[...]
> @@ -110,34 +119,48 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
> #endif
> : "=m" (S390_lowcore.last_update_timer),
> "=m" (S390_lowcore.last_update_clock));
> - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
> - S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock;
> + clock = S390_lowcore.last_update_clock - clock;
> + timer -= S390_lowcore.last_update_timer;
> +
> + if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> + S390_lowcore.guest_timer += timer;
> + else if (hardirq_count() - hardirq_offset)
> + S390_lowcore.hardirq_timer += timer;
> + else if (in_serving_softirq())
> + S390_lowcore.softirq_timer += timer;
> + else
> + S390_lowcore.system_timer += timer;

I initially thought that some code could be shared for that whole accumulation. Now I
don't know if it would be a good idea. An example would be to deal with the contexts above
in order to store the accumulation to the appropriate place.

>
> /* Update MT utilization calculation */
> if (smp_cpu_mtid &&
> time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
> update_mt_scaling();
>
> + /* Calculate cputime delta */
> user = S390_lowcore.user_timer - tsk->thread.user_timer;
> - S390_lowcore.steal_timer -= user;
> tsk->thread.user_timer = S390_lowcore.user_timer;
> -
> + guest = S390_lowcore.guest_timer - tsk->thread.guest_timer;
> + tsk->thread.guest_timer = S390_lowcore.guest_timer;
> system = S390_lowcore.system_timer - tsk->thread.system_timer;
> - S390_lowcore.steal_timer -= system;
> tsk->thread.system_timer = S390_lowcore.system_timer;
> -
> - user_scaled = user;
> - system_scaled = system;
> - /* Do MT utilization scaling */
> - if (smp_cpu_mtid) {
> - u64 mult = __this_cpu_read(mt_scaling_mult);
> - u64 div = __this_cpu_read(mt_scaling_div);
> -
> - user_scaled = (user_scaled * mult) / div;
> - system_scaled = (system_scaled * mult) / div;
> - }
> - account_user_time(tsk, user, user_scaled);
> - account_system_time(tsk, hardirq_offset, system, system_scaled);
> + hardirq = S390_lowcore.hardirq_timer - tsk->thread.hardirq_timer;
> + tsk->thread.hardirq_timer = S390_lowcore.hardirq_timer;
> + softirq = S390_lowcore.softirq_timer - tsk->thread.softirq_timer;
> + tsk->thread.softirq_timer = S390_lowcore.softirq_timer;
> + S390_lowcore.steal_timer +=
> + clock - user - guest - system - hardirq - softirq;
> +
> + /* Push account value */
> + if (user)
> + account_user_time(tsk, user, scale_vtime(user));
> + if (guest)
> + account_guest_time(tsk, guest, scale_vtime(guest));
> + if (system)
> + account_sys_time(tsk, system, scale_vtime(system));
> + if (hardirq)
> + account_hardirq_time(tsk, hardirq, scale_vtime(hardirq));
> + if (softirq)
> + account_softirq_time(tsk, softirq, scale_vtime(softirq));

And doing that would be another part of the shared code.

>
> steal = S390_lowcore.steal_timer;
> if ((s64) steal > 0) {
> @@ -145,16 +168,22 @@ static int do_account_vtime(struct task_struct *tsk, int hardirq_offset)
> account_steal_time(steal);
> }
>
> - return virt_timer_forward(user + system);
> + return virt_timer_forward(user + guest + system + hardirq + softirq);
> }
>
> void vtime_task_switch(struct task_struct *prev)
> {
> do_account_vtime(prev, 0);
> prev->thread.user_timer = S390_lowcore.user_timer;
> + prev->thread.guest_timer = S390_lowcore.guest_timer;
> prev->thread.system_timer = S390_lowcore.system_timer;
> + prev->thread.hardirq_timer = S390_lowcore.hardirq_timer;
> + prev->thread.softirq_timer = S390_lowcore.softirq_timer;
> S390_lowcore.user_timer = current->thread.user_timer;
> + S390_lowcore.guest_timer = current->thread.guest_timer;
> S390_lowcore.system_timer = current->thread.system_timer;
> + S390_lowcore.hardirq_timer = current->thread.hardirq_timer;
> + S390_lowcore.softirq_timer = current->thread.softirq_timer;
> }

Ditto.

>
> /*
> @@ -174,31 +203,22 @@ void vtime_account_user(struct task_struct *tsk)
> */
> void vtime_account_irq_enter(struct task_struct *tsk)
> {
> - u64 timer, system, system_scaled;
> + u64 timer;
>
> timer = S390_lowcore.last_update_timer;
> S390_lowcore.last_update_timer = get_vtimer();
> - S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
> -
> - /* Update MT utilization calculation */
> - if (smp_cpu_mtid &&
> - time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
> - update_mt_scaling();
> -
> - system = S390_lowcore.system_timer - tsk->thread.system_timer;
> - S390_lowcore.steal_timer -= system;
> - tsk->thread.system_timer = S390_lowcore.system_timer;
> - system_scaled = system;
> - /* Do MT utilization scaling */
> - if (smp_cpu_mtid) {
> - u64 mult = __this_cpu_read(mt_scaling_mult);
> - u64 div = __this_cpu_read(mt_scaling_div);
> -
> - system_scaled = (system_scaled * mult) / div;
> - }
> - account_system_time(tsk, 0, system, system_scaled);
> -
> - virt_timer_forward(system);
> + timer -= S390_lowcore.last_update_timer;
> +
> + if ((tsk->flags & PF_VCPU) && (irq_count() == 0))
> + S390_lowcore.guest_timer += timer;
> + else if (hardirq_count())
> + S390_lowcore.hardirq_timer += timer;
> + else if (in_serving_softirq())
> + S390_lowcore.softirq_timer += timer;
> + else
> + S390_lowcore.system_timer += timer;

And Ditto.

We could put together the accumulation in a common struct in s390_lowcore,
and its mirror in thread struct then have helpers take care of the contexts.

How does that sound to you, would it help or hurt?

Thanks.