Re: [PATCH]cpuset: add new API to change cpuset top group's cpus

From: Vaidyanathan Srinivasan
Date: Wed May 20 2009 - 13:21:55 EST


* Len Brown <lenb@xxxxxxxxxx> [2009-05-19 15:01:46]:

> > ... the point is, we
> > don't need a new interface to force a cpu idle. Hotplug does that.
> >
> > Furthermore, we should not want anything outside of that, either the cpu
> > is there available for work, or its not -- halfway measures don't make
> > sense.
> >
> > Furthermore, we already have power aware scheduling which tries to
> > aggregate idle time on cpu/core/packages so as to maximize the idle time
> > power savings. Use it there.
>
> Some context...
>
> In the past, server room power and thermal issues were handled
> either by spending too much money to provision power and
> thermals for theoretical worst case, or by abruptly shutting off
> servers when hard limits were reached.
>
> Going forward, platforms are getting smarter, measuring how
> much power is drawn from the power supply, measuring the room
> thermals etc. so that real dollars can be saved by deploying
> systems that exceed the theoretical worst case if the power
> and thermal limits are enforced.
>
> So if server approaches a budget, the platform
> will notify the OS to limit its P-states, and limit its T-states
> in order to draw less power.
>
> If that is not sufficient, the platform will ask us to take
> processors off-line. These are not processors that are otherwise idle
> -- those are already saving as much power as they can --
> these are processors that are fully utilized.
>
> So power-aware scheduling is moot here, this isn't the
> partially idle case, this is the fully utilized case.

Hi Len,

Over and above power-aware scheduling we have been exploring
possibility of forcefully idle cpu for power savings. This is mostly
useful in thermal case that you have mentioned and also to provide
fine grain power vs performance trade-offs. Creating idle times and
consolidating idle time efficiently in order to evacuate cores and
packages provides a framework to exploit C-States apart from P-States
and T-States that you have mentioned above. Addition of C-States
control to save power and heat may make the system do more
instructions at a given power/thermal constraint.

Reference: http://lkml.org/lkml/2009/5/13/173

> If power draw continues to be too high, the platform
> will simply ask us to take more processors off line.
>
> If this dance doesn't reduce power below that required,
> the platform will be shut off.
>
> So it is sufficient to simply not schedule cpu burners
> on the 'idled' processor. Interrupts should generally
> not matter -- and if they do, we'll end up simply idling
> an additional processor.

The requirements and use cases are clear.

> > > > Besides, a hot removed cpu will do a dead loop halt, which isn't power saving
> > > > efficient. To make hot removed cpu enters deep C-state is in whish list for a
> > > > long time, but still not available. The acpi_processor_idle is a module, and
> > > > cpuidle governor potentially can't handle offline cpu.
> > >
> > > Then fix that hot-unplug idle loop. I agree that the hlt thing is silly,
> > > and I've no idea why its still there, seems like a much better candidate
> > > for your efforts than this.
>
> CONFIG_HOTPLUG_CPU has been problematic in the past.
> It does more than what we need here, so we thought
> a lighter-weight and lower-latency method that simply
> didn't schedule to the idled cpu would suffice.
>
> Personally, I don't think that CONFIG_HOTPLUG_CPU should exist,
> taking processors on and off-line should be part of CONFIG_SMP.
>
> A while back when I selected CONFIG_HOTPLUG_CPU from ACPI && SMP,
> there was a torrent of outrage that it infringed on user's right's
> to save that additional 18KB of memory that CONFIG_HOTPLUG_CPU
> includes that SMP does not...
>
> We are fixing the hotplug-unplug idle loop, but there
> turns out to be some issues with it related to idle
> processors with interrupts disabled that don't actually
> get down into the deep C-states we request:-(

Fixing the hot-unplug idle loop will help us use the cpu-hotplug
infrastructure for many other purposes like power/thermal management
purposes. Do you think there could be some workaround/solution for
this in short term?

> So this is why you see a patch for a "halfway measure",
> it does what is necessary, and does nothing more.

Peter had detailed comments on this aspect.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/