Re: [PATCH 1/3] firmware: Avoid superfluous usermodehelper lock

From: Takashi Iwai
Date: Fri May 10 2013 - 05:32:28 EST


At Fri, 10 May 2013 09:25:51 +0800,
Ming Lei wrote:
>
> On Fri, May 10, 2013 at 1:04 AM, Takashi Iwai <tiwai@xxxxxxx> wrote:
> > At Thu, 9 May 2013 16:43:28 +0800,
> > Ming Lei wrote:
> >>
> >> On Thu, May 9, 2013 at 3:31 PM, Takashi Iwai <tiwai@xxxxxxx> wrote:
> >> > At Thu, 9 May 2013 09:25:35 +0800,
> >> > Ming Lei wrote:
> >> >>
> >> >> On Thu, May 9, 2013 at 1:51 AM, Takashi Iwai <tiwai@xxxxxxx> wrote:
> >> >> >> In other words, the first patch is no essential part of the fix.
> >> >> >> I can revisit the second patch without this one and resend if
> >> >> >> preferred.
> >> >> >
> >> >> > FWIW, below is the revised patch.
> >> >> > It's alone without the patch 1 in the previous series.
> >> >>
> >> >> The root cause is that your user space loader doesn't follow the
> >> >> current firmware loader interface.
> >> >
> >> > There is not necessarily a user space "loader". It's declared as
> >> > non-hotplug, thus it can be a manual operation by human.
> >> >
> >> >> IMO, the patch is unnecessary since we already have the timeout
> >> >> abort(just need one patch to enable it for nowait api)
> >> >
> >> > Well, you cannot know any sane value for such human's operation.
> >> > If it's a system response, then we can assume something. But the
> >> > invocation of request_firmware_nowait() with non-hotplug means that
> >> > you can never know the actual use case, thus you cannot know any sane
> >>
> >> I think the use case should be driver specific, and the loading is triggered
> >> from user space in dell_rbu(write sysfs file to trigger BIOS update), so the
> >> user has been ready for loading the image.
> >
> > Yes, it's ready, but it doesn't guarantee that it's done in which time
> > limit. That's the uncertain point in such an interface.
> >
> > If it were hotplug via udev, we can assume some sane time limit.
> > But in a scenario like the above, with the manual intervention, we
> > can't know what is the sane value in a single manner.
> >
> >> For another usage(lattice-ecp3-config.c), it is merged recently and very
> >> specific(maybe only for personal use), and can be easily to change to
> >> trigger loading from user space like dell.
> >>
> >> I think both the two usages choose FW_ACTION_NOHOTPLUG
> >> because they have out of tree firmware images. So looks enabling
> >> timeout won't be a big deal for them.
> >
> > Yeah, but it's a guess.
> >
> > So, in these use cases, the practical impact would be small, I agree.
> > Changing this (adding a timeout unconditionally) eases the shutdown
> > stall. But I still feel this is no essential fix, and in general,
> > changing the user-space behavior has to be done really carefully.
>
> If there are lots of drivers which are using FW_ACTION_NOHOTPLUG,
> I agree with you. But the fact is that only two, and both uses firmware
> images which aren't shipped by distribution, so it isn't a big deal, is it?

Yeah, this particular thing could be eased by the timeout, but it's
no real "fix". The blocking behavior is just a non-sense there.

> >> > timeout, too.
> >> >
> >> > And secondly, I don't think it's good to rely on the timeout. Why
> >> > does the system have to wait for minute for shutdown? The system is
> >>
> >> It won't if the user space follows the rules.
> >
> > Well... what rule? The kernel shutdown must be blocked when user
> > space doesn't write 0 or -1 to /sys/class/firmware/*/loading?
>
> Yes.
>
> > If you meant it, I would say that the rule is wrong. There is no big
>
> Why? It has been here for ten years.

No. This is no old behavior (in the sense of enterprise products :)
It was introduced since 3.3 or 3.4 kernel, and this bug actually hits
our customer now. That's why I had to work on this nasty thing...
Before 3.3/3.4, there was no blocking behavior at shutdown.

And I bet that this problem was simply overseen at that time. The fix
at that time was basically for the race at suspend/resume.

> > reason to block the *kernel* shutdown by such an action.
>
> The delay just shows that the user doesn't follow the rule.

Again, the rule is just wrong.

> > We're handling the moment where the system should be really shut down,
> > already after all user-space things are synced and killed. For
> > example, would we delay the shutdown until all opened files are
> > closed even at this point? The kernel doesn't do so.
>
> shutdown isn't unconditional, for example, page cache must be flushed
> to disk.

Your example doesn't fit really. "Can user-space block/delay the
suspend at the point where kernel_restart() or kernel_halt() is
called?" That's the question. The page cache flush before shutdown is
a pure kernel action, and there is no user-space involvement to
block/delay.

> >> > in shutdown, and it's triggered by user. It's more natural to abort
> >> > the pending f/w loading because you don't want to handle it any
> >> > longer after the system shuts down.
> >>
> >> There is still risk to force killing the loader before shutdown or
> >> suspend. Maybe some devices depend its firmware in shutdown
> >> or suspend callback to configure power setting.
> >
> > The firmware loading is never guaranteed to succeed. If the abort of
>
> Yes, but it is guaranteed that it will be ended in some time, and your
> problem is nothing to do with success, just about timing.

There is no need for wait at that point, and this is a clear
regression in comparison with the old kernels.

> > f/w loading at shutdown would cause any problem, it means that the
> > driver is fundamentally buggy, and we must fix it inevitably anyway.
> >
> > Of course, the suspend is a bit different issue. Maybe better to
> > retry the load after resume instead of forcibly aborting.
> > My point is that, if such critical kernel behavior like suspend or
> > shutdown rely on a timeout of user actions, it's badly designed.
>
> Anyway, if you want to force killing loader, please only kill these
> FW_ACTION_NOHOTPLUG before suspend and reboot, and do
> not touch FW_ACTION_HOTPLUG. Is it OK for you?

Note that, as with my patch, only the shutdown case is handled. Let's
not mixing up suspend and shutdown behavior for now.

I see no reason why we need to wait at shutdown even for
FW_ACTION_HOTPLUG. At that point, there should be no longer
user-space action. It means that the driver shouldn't get any more
data to finish the f/w loading upon that point. Thus the only
possible consequence is the timeout, which is equivalent with the
immediate abort of the operation.

As mentioned earlier, the suspend behavior may be different. We want
to retry the f/w load. Ideally, the f/w loader should abort and
automatically retry after resume. In this case, also there is no big
reason to distinguish FW_ACTION_* types. Even for udev case, the
action can be easily retried.

Or, a cleaner solution would be to get rid of FW_ACTION_NOHOTPLUG
completely. As Kay mentioned, this was a big mistake from the very
beginning. Fortunately, there are little users, so if we are allowed
to change the interface, it'd be relatively easy to drop this mode.
And the rest cases are usually covered by the direct f/w loading.


thanks,

Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/