Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4

From: Ingo Molnar
Date: Wed Feb 03 2016 - 07:49:36 EST



* Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:

> On Wed, Feb 03, 2016 at 12:28:49PM +0100, Ingo Molnar wrote:
> >
> > * Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > > Changelog since v3
> > > o Force enable stats during profiling and latencytop
> > >
> > > Changelog since V2
> > > o Print stats that are not related to schedstat
> > > o Reintroduce a static inline for update_stats_dequeue
> > >
> > > Changelog since V1
> > > o Introduce schedstat_enabled and address Ingo's feedback
> > > o More schedstat-only paths eliminated, particularly ttwu_stat
> > >
> > > schedstats is very useful during debugging and performance tuning but it
> > > incurs overhead. As such, even though it can be disabled at build time,
> > > it is often enabled as the information is useful. This patch adds a
> > > kernel command-line and sysctl tunable to enable or disable schedstats on
> > > demand. It is disabled by default as someone who knows they need it can
> > > also learn to enable it when necessary.
> > >
> > > The benefits are workload-dependent but when it gets down to it, the
> > > difference will be whether cache misses are incurred updating the shared
> > > stats or not. [...]
> >
> > Hm, which shared stats are those?
>
> Extremely poor phrasing on my part. The stats share a cache line and the impact
> partially depends on whether unrelated stats share a cache line or not during
> updates.

Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there
any 'global' (or per node) counters that we keep touching and which keep
generating cache-misses?

> > I think we should really fix those as well: those shared stats should be
> > percpu collected as well, with no extra cache misses in any scheduler fast
> > path.
>
> I looked into that but converting those stats to per-cpu counters would incur
> sizable memory overhead. There are a *lot* of them and the basic structure for
> the generic percpu-counter is
>
> struct percpu_counter {
> raw_spinlock_t lock;
> s64 count;
> #ifdef CONFIG_HOTPLUG_CPU
> struct list_head list; /* All percpu_counters are on a list */
> #endif
> s32 __percpu *counters;
> };

We don't have to reuse percpu_counter().

> That's not taking the associated runtime overhead such as synchronising them.

Why do we have to synchronize them in the kernel? User-space can recover them on a
percpu basis and add them up if it wishes to. We can update the schedstat utility
to handle the more spread out fields as well.

Thanks,

Ingo