Re: [PATCH 1/2] firmware, fix request_firmware_nowait() freeze withno uevent

From: Prarit Bhargava
Date: Mon Oct 21 2013 - 18:24:54 EST




On 10/21/2013 08:24 AM, Ming Lei wrote:
> On Mon, Oct 21, 2013 at 5:35 AM, Prarit Bhargava <prarit@xxxxxxxxxx> wrote:
>> If request_firmware_nowait() is called with uevent == NULL, the firmware
>> completion is never marked complete resulting in a hang in the process.
>>
>> If uevent is undefined, that means we're not waiting on anything and the
>> process should just clean up and complete. While we're at it, add a
>> debug dev_dbg() to indicate that the FW has not been found.
>>
>> Signed-off-by: Prarit Bhargava <prarit@xxxxxxxxxx>
>> Cc: x86@xxxxxxxxxx
>> Cc: herrmann.der.user@xxxxxxxxxxxxxx
>> Cc: ming.lei@xxxxxxxxxxxxx
>> Cc: tigran@xxxxxxxxxxxxxxxxxxxx
>> ---
>> drivers/base/firmware_class.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
>> index 10a4467..95778dc 100644
>> --- a/drivers/base/firmware_class.c
>> +++ b/drivers/base/firmware_class.c
>> @@ -335,7 +335,8 @@ static bool fw_get_filesystem_firmware(struct device *device,
>> set_bit(FW_STATUS_DONE, &buf->status);
>> complete_all(&buf->completion);
>> mutex_unlock(&fw_lock);
>> - }
>> + } else
>> + dev_dbg(device, "firmware: %s not found\n", buf->fw_id);
>>
>> return success;
>> }
>> @@ -886,6 +887,9 @@ static int _request_firmware_load(struct firmware_priv *fw_priv, bool uevent,
>> schedule_delayed_work(&fw_priv->timeout_work, timeout);
>>
>> kobject_uevent(&fw_priv->dev.kobj, KOBJ_ADD);
>> + } else {
>> + /* if there is no uevent then just cleanup */
>> + schedule_delayed_work(&fw_priv->timeout_work, 0);
>> }
>
> This may not a good idea and might break current NOHOTPLUG
> users,

Ming,

The code is broken for all callers of request_firmware_nowait() with NOHOTPLUG
and CONFIG_FW_LOADER_USER_HELPER=y. AFAICT with the two existing cases of this
usage in the kernel, both are broken and both are attempting to do the same
thing that I'm doing in the x86 microcode ATM.

This is the situation as I understand it and please correct me if I'm wrong
about the execution path. If I call request_firmware_nowait() with NOHOTPLUG I
am essentially saying that there is no uevent associated with this firmware
load; that is uevent = 0. request_firmware_work_func() is called as scheduled
task, which results in a call to _request_firmware(). _request_firmware() first
calls _request_firmware_prepare() which eventually results in a call to
__allocate_fw_buf() which does an init_completion(&buf->completion).

Returning back up the stack to _request_firmware() we eventually call
fw_get_filesystem_firmware(). _If the firmware does not exist_ success is false
and the if (success) loop is not executed, and it is important to note that the
complete_all(&buf->completion) is _not_ called. fw_get_filesystem_firmware()
returns an error so that fw_load_from_user_helper() is called from
_request_firmware().

fw_load_from_user_helper() eventually calls _request_firmware_load() and this is
where we get into a problem. fw_load_from_user_helper() calls all the file
creation, etc., and then hits this chunk of code:

if (uevent) {
dev_set_uevent_suppress(f_dev, false);
dev_dbg(f_dev, "firmware: requesting %s\n", buf->fw_id);
if (timeout != MAX_SCHEDULE_TIMEOUT)
schedule_delayed_work(&fw_priv->timeout_work, timeout);

kobject_uevent(&fw_priv->dev.kobj, KOBJ_ADD);
}

wait_for_completion(&buf->completion);

As I previously said, we've been called with NOHOTPLUG, ie) uevent = 0. That
means we skip down to the wait_for_completion(&buf->completion) ... and we wait
... forever.

I can reproduce this by using a Dell PE 1850 & the dell_rbu module by doing the
following:

insmod dell_rbu.ko
echo init > /sys/devices/platform/dell_rbu/image_type
lsmod | grep dell_rbu

(after an hour)

[root@dell-pe1850-04 dell_rbu]# lsmod | grep dell_rbu
dell_rbu 14315 1
[root@dell-pe1850-04 dell_rbu]#

^^^ that use count is left because the thread is waiting with an existing module
ref count. For kicks I put a printk in the dell_rbu code or instrument the
_request_firmware() code and did a reboot. Since the completions are finished
on system shutdown, I see the code continue to execute at the end of boot.

> and how can you make sure the user space application can
> complete the request during the timeout time?

I see that your question really comes down to "are there additional
synchronizations needed in the two drivers that already call the code this way?"
I realize that the answer to that is yes and I'll fix those up in a v2. It
should be trivial to make those changes AFAICT. I've introduced some additional
synchronization via a completion in the x86 microcode and will likely have to do
something similar in the other drivers ... although it may be easier to just
have the firmware code do all the synchronization. I'll look into it.

Hope this explains things a bit better,

P.

>
> Thanks,
> --
> Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/