Re: [PATCHv2] firmware: Correct handling of fw_state_wait_timeout() return value

From: Luis R. Rodriguez
Date: Tue Jan 17 2017 - 12:30:50 EST


On Tue, Jan 17, 2017 at 08:30:37AM -0800, Jakub Kicinski wrote:
> On Tue, Jan 17, 2017 at 8:21 AM, Luis R. Rodriguez <mcgrof@xxxxxxxxxx> wrote:
> >>>
> >>> retval = fw_state_wait_timeout(&buf->fw_st, timeout);
> >>> - if (retval < 0) {
> >>> + if (retval == -ETIMEDOUT || retval == -ERESTARTSYS) {
> >>> mutex_lock(&fw_lock);
> >>> fw_load_abort(fw_priv);
> >>> mutex_unlock(&fw_lock);
> >>
> >> This is a bit messy, two other similar issues were reported before
> >> and upon review I suggested Patrick Bruenn's fix with a better commit
> >> log seems best fit. Patrick sent a patch Jan 4, 2017 but never followed up
> >> despite my feedback on a small change on the commit log message [0]. Can you
> >> try that and if that fixes it can you adjust the commit log accordingly? Please
> >> note the preferred solution would be:
> >>
> >> diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
> >> index b9ac348e8d33..c530f8b4af01 100644
> >> --- a/drivers/base/firmware_class.c
> >> +++ b/drivers/base/firmware_class.c
> >> @@ -542,6 +542,8 @@ static struct firmware_priv *to_firmware_priv(struct device *dev)
> >>
> >> static void __fw_load_abort(struct firmware_buf *buf)
> >> {
> >> + if (!buf)
> >> + return;
>
> Allow me to try to persuade you one last time :) My patch makes the
> code more logical and easier to follow. The code says:
> in case no wake up happened - finish the wait (otherwise the waking
> thread finishes it).

Your patch is still wrong, as Patrick great commit log notes a null defer
can also happen on a race with a case of -1 being sent and a -ENOENT error,
so we'd have to adjust for when __fw_state_wait_common() returns also
-ENOENT.

> Adding a NULL-check would just paper over the
> issue and can cause trouble down the line.

We typically bail on errors and use similar code to bail out, and we
typically do these things. Here its no different. The *real* issue
is the fact that we have a waiting timeout which can fail race against
a user imposed error out on the sysfs interface. There is one catch:

We already lock with the big fw_lock and use this to be able to check
for the status of the fw, so once aborted we technically should not have
to abort again. A proper way to address then this would have been to check
for the status of the fw prior to aborting again given we also lock on the
big fw_lock. A problem with this though is the status is part of the buf
which is set to NULL after we are done aborting.

> If fw_state_wait_timeout()
> returned because someone woke it up - there is no reason to abort the
> wait. The wait is already finished. The buggy commit mixed up return
> codes from fw_state_wait_timeout() - mixed "nobody woke us up" with
> "we couldn't find the FW", that's why we need to check for specific
> error codes.

Nope, sorry I still believe the check on buf is needed. The code not not
ideal, but I welcome further changes to help shape it up, such as the
changes Daniel has been helping with.

Patrick, if you can document what I mentioned about what I said about the catch
in your commit log it could help further reviewers *why* we do it this way.

Luis