Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation usingsched_mc=n

From: Vaidyanathan Srinivasan
Date: Mon Apr 27 2009 - 02:39:38 EST

Next message: Sam Ravnborg: "Re: [PATCH 14/15] x86: convert to use __HEAD and HEAD_TEXT macros."
Previous message: David Woodhouse: "Re: [PATCH] ihex2fw should be ignored by scripts/.gitignorebecause ihex2fw is in scripts directory"
In reply to: Ingo Molnar: "Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation usingsched_mc=n"
Next in thread: Balbir Singh: "Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation usingsched_mc=n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Ingo Molnar <mingo@xxxxxxx> [2009-04-27 07:53:47]:

>
> * Vaidyanathan Srinivasan <svaidy@xxxxxxxxxxxxxxxxxx> wrote:
>
> > > > --------------------------------------------------------
> > > > sched_mc No Cores Performance AvgPower
> > > > used Records/sec (Watts)
> > > > --------------------------------------------------------
> > > > 0 8 1.00x 1.00y
> > > > 1 8 1.02x 1.01y
> > > > 2 8 0.83x 1.01y
> > > > 3 7 0.86x 0.97y
> > > > 4 6 0.76x 0.92y
> > > > 5 4 0.72x 0.82y
> > > > --------------------------------------------------------
> > >
> > > Looks like we want the kernel default to be sched_mc=1 ?
> >
> > Hi Ingo,
> >
> > Yes, sched_mc wins for a simple cpu bound workload like this. But
> > the challenge is that the best settings depends on the workload
> > and the system configuration. This leads me to think that the
> > default setting should be left with the distros where we can
> > factor in various parameters and choose the right default from
> > user space.
> >
> >
> > > Regarding the values for 2...5 - is the AvgPower column time
> > > normalized or workload normalized?
> >
> > The AvgPower is time normalised, just the power value divided by
> > the baseline at sched_mc=0.
> >
> > > If it's time normalized then it appears there's no power win
> > > here at all: we'd be better off by throttling the workload
> > > directly (by injecting sleeps or something like that), right?
> >
> > Yes, there is no power win when comparing with peak benchmark
> > throughput in this case. However more complex workload setup may
> > not show similar characteristics because they are not dependent
> > only on CPU bandwidth for their peak performance.
> >
> > * Reduction in cpu bandwidth may not directly translate to performance
> > reduction on complex workloads
> > * Even if there is degradation, the system may still meet the design
> > objectives. 20-30% increase in response time over a 1 second
> > nominal value may be acceptable in most cases
>
> But ... we could probably get a _better_ (near linear) slowdown by
> injecting wait cycles into the workload.

We have advantages when complete cpu packages are not used as opposed
to just injecting idle time in all cores.

> I.e. we should only touch balancing if there's a _genuine_ power
> saving: i.e. less power is used for the same throughput.

Load balancer knows the cpu package topology and in essence knows the
most power efficient combinations of cores to use. If we have to
schedule on 4 cores in a 8 core system, the load balancer can pick the
right combination.

> The numbers in the table show a plain slowdown: doing fewer
> transactions means less power used. But that is trivial to achieve
> for a CPU-bound workload: throttle the workload. I.e. inject less
> work, save power.

Agreed, this example does not show the best use case for this
feature, however we can easily experimentally verify that targeted
evacuation of cores can provide better performance-per-watt as
compared to plain throttling to reduce utilisation.

> And if we want to throttle 'transparently', from the kernel, we
> should do it not via an artificial open-ended scale of
> sched_mc=2,3,4,5... - we should do it via a _percentage_ value.

Yes we want to transparently throttle from the kernel at a core level
granularity.

Having a percentage value that can take discrete steps based on the
number of cores in the system is a good idea. I will switch the
parameter to percentage in the next iteration.

> I.e. a system setting that says "at most utilize the system 80% of
> its peak capacity". That can be implemented by the kernel injecting
> small delays or by intentionally not scheduling on certain CPUs (but
> not delaying tasks - forcing them to other cpus in essence).

Advances in hardware power management like very low power deep sleep
states and further package level power savings when all cores are idle
changes the above assumption.

Uniformly adding delays on all CPUs provide far less power savings as
compared to not using one core or one complete package. Evacuating
core/package essentially shuts them off as compared to very short
bursts of idle times.

If we can accumulate all such idle times to a single core, with little
effect on fairness, we get better power savings for the same amount of
idle time or utilisation.

Agreed that this is a coarse granularity compared to injecting delay,
but this will become practical as the core density increase in the
enterprise processor design.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sam Ravnborg: "Re: [PATCH 14/15] x86: convert to use __HEAD and HEAD_TEXT macros."
Previous message: David Woodhouse: "Re: [PATCH] ihex2fw should be ignored by scripts/.gitignorebecause ihex2fw is in scripts directory"
In reply to: Ingo Molnar: "Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation usingsched_mc=n"
Next in thread: Balbir Singh: "Re: [RFC PATCH v1 0/3] Saving power by cpu evacuation usingsched_mc=n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]