Re: PM runtime_error handling missing in many drivers?

From: Rafael J. Wysocki
Date: Fri Jul 08 2022 - 16:10:50 EST


On 7/8/2022 1:03 PM, Vincent Whitchurch wrote:
On Tue, Jun 21, 2022 at 11:38:33AM +0200, Oliver Neukum wrote:
On 20.06.22 16:42, Vincent Whitchurch wrote:
[110778.050000][ T27] rpm_resume: 0-0009 flags-4 cnt-1 dep-0 auto-1 p-0 irq-0 child-0
[110778.050000][ T27] rpm_return_int: rpm_resume+0x24d/0x11d0:0-0009 ret=-22

The following patch fixes the issue on vcnl4000, but is this the right in the
fix? And, unless I'm missing something, there are dozens of drivers
with the same problem.
Yes. The point of pm_runtime_resume_and_get() is to remove the need
for handling errors when the resume fails. So I fail to see why a
permanent record of a failure makes sense for this API.
I don't understand it either.

diff --git a/drivers/iio/light/vcnl4000.c b/drivers/iio/light/vcnl4000.c
index e02e92bc2928..082b8969fe2f 100644
--- a/drivers/iio/light/vcnl4000.c
+++ b/drivers/iio/light/vcnl4000.c
@@ -414,6 +414,8 @@ static int vcnl4000_set_pm_runtime_state(struct vcnl4000_data *data, bool on)
if (on) {
ret = pm_runtime_resume_and_get(dev);
+ if (ret)
+ pm_runtime_set_suspended(dev);
} else {
pm_runtime_mark_last_busy(dev);
ret = pm_runtime_put_autosuspend(dev);
If you need to add this to every driver, you can just as well add it to
pm_runtime_resume_and_get() to avoid the duplication.
Yes, the documentation says that the error should be cleared, but it's
unclear why the driver is expected to do it. From the documentation it
looks the driver is supposed to choose between pm_runtime_set_active()
and pm_runtime_set_suspended() to clear the error, but how/why is this
choice supposed to be made in the driver when the driver doesn't know
more than the framework about the status of the device?

Perhaps Rafael can shed some light on this.

The driver always knows more than the framework about the device's actual state.  The framework only knows that something failed, but it doesn't know what it was and what way it failed.


But I am afraid we need to ask a deeper question. Is there a point
in recording failures to resume? The error code is reported back.
If a driver wishes to act upon it, it can. The core really only
uses the result to block new PM operations.
But nobody requests a resume unless it is necessary. Thus I fail
to see the point of checking this flag in resume as opposed to
suspend. If we fail, we fail, why not retry? It seems to me that the
record should be used only during runtime suspend.
I guess this is also a question for Rafael.

Even if the error recording is removed from runtime_resume and only done
on suspend failures, all these drivers still have the problem of not
clearing the error, since the next resume will fail if that is not done.

The idea was that drivers would clear these errors.


And as an immediate band aid, some errors like ENOMEM should
never be recorded.