Re: [RFC PATCH V1 1/2] cpuidle: Data structure changes for globalcpuidle device

From: Vaidyanathan Srinivasan
Date: Fri Mar 25 2011 - 13:49:09 EST


* Len Brown <lenb@xxxxxxxxxx> [2011-03-25 04:12:03]:

> I agree it is silly to allocate a cpuidle_device
> for every cpu in the system as we do today.
>
> Yes, splitting the counters out of cpuidle_device
> is a necessary part of fixing that.
>
> However, cpuidle_device.cpuidle_state[] is currently not per-driver,
> it is per-cpu, and it is writable.
>
> In particular, the cpuidle_device->prepare() mechanism
> causes updates to the cpuidle_state[].flags,
> setting and clearing CPUIDLE_FLAG_IGNORE to
> tell the governor not to chose a state
> on a per-cpu basis at run-time.
>
> I don't like that mechanism.
> I'd like to see it replaced, and when replaced,
> cpuidle_state[] can be per system-wide driver.

Thanks for the detailed review. I agree that we should rework
handling of the cpuidle_state[].flags. However, is the prepare()
mechanism used at all? Can we remove the option completely?

> I think the real problem that prepare() was trying to solve
> is that the driver today does not have the ability to over-rule
> the choice made by the governor. The driver may discover
> in the course of trying to satisfy the request of the governor
> that it needs to demote to a shallower state; or it may
> do its best to satisfy the governor's request, and the hardware
> may demote its request to a shallower state.
>
> Unfortunately, when this happens, the driver dutifully
> returns the time spent in the state to cpuidle_idle_call(),
> who then updates the wrong last_residency, time, and usage counters.

I did not get this scenario. Are you saying

target_state->enter(dev, target_state) can enter a different state
than the one suggested by target_state?

I understand the hardware demotion part, but can we really detect the
target 'demoted' state in that case? I guess not.

> Sure is ironic for the driver to allocate the data structures and
> then hand the timer to the uppper layer, just to have the upper layer
> update the wrong data structures...
>
> Surely the driver enter routine should update the counters
> that the driver was obligated to allocate, and it should return
> the state actually entered (for tracing), rather than the time spent
> there.

Can we do something like this:

last_state = target_state->enter(dev, target_state)

dev->last_state and dev->last_residency are updated inside
target_state->enter()

The returned last_state is just for tracing, actual data is already
updated in the cpuidle_dev structure and used for sysfs display.

> The generic cpuidle code should simply handle where the counters live
> in the sysfs namespace, not updating the counters.
> This needs to be addressed before cpuidle_device.cpuidle_state[]
> can be made one/system.

Agreed.

Thanks again for the recommendations.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/