Re: [PATCH 2/8] introduce the device async action mechanism

From: Rafael J. Wysocki
Date: Mon Jul 20 2009 - 12:24:00 EST


On Friday 17 July 2009, Zhang Rui wrote:
> On Fri, 2009-07-17 at 09:11 +0800, Rafael J. Wysocki wrote:
> > On Thursday 16 July 2009, Zhang Rui wrote:
> > > On Wed, 2009-07-15 at 21:00 +0800, Arjan van de Ven wrote:
> > > > On Wed, 15 Jul 2009 15:38:36 +0800
> > > > Zhang Rui <rui.zhang@xxxxxxxxx> wrote:
> > > >
> > > > > Introduce the device async action mechanism.
> > > > >
> > > > > In order to speed up Linux suspend/resume/shutdown process,
> > > > > we introduce the device async action mechanism that allow devices
> > > > > to suspend/resume/shutdown asynchronously.
> > > > >
> > > > > The basic idea is that,
> > > > > if the suspend/resume/shutdown process of a device set,
> > > > > including a root device and its child devices, are independent of
> > > > > other devices, we create an async domain for this device set,
> > > > > and make them suspend/resume/shutdown asynchronously.
> > > >
> > > > Hi,
> > > >
> > > > I have some concerns about having an async domain per device(group)
> > > > rather than having one async domain for all of this,
> > >
> > > ïwe create an async domain ONLY if we are sure that the device group is
> > > independent with the other devices.
> > >
> > > and IMO, using multiple async domains brings real device async actions.
> > > For example, in S3 resume case, there are two device groups:
> > > device group1: device1, device2, device3
> > > device group2: device4, device5, device6
> > >
> > > If they share the global domain, we may get:ï
> > > device group1: device1(cookie 1), device2(cookie 4), device3(cookie 5)
> > > device group2: device4(cookie 2), device5(cookie 3), device6(cookie 6)
> > >
> > > this is not real asynchronous resume because
> > > device3 needs to call async_synchronize_cookie(5) before resume itself.
> > > which means that device4 and device5 must be resumed before device3.
> > >
> > > But if multiple async domain is used, we get:
> > > ïdevice group1: device1(cookie 1), device2(cookie 2), device3(cookie 3)
> > > device group2: device4(cookie 1), device5(cookie 2), device6(cookie 3)
> > >
> > > device group1 and group2 can be resumed asynchronously.
> > >
> > >
> > > Another example, in my previous test,
> > > 1. sata controller. takes 0.4s to resume.
> > > 2. usb, including uchi and ehci controller takes 1.4s to resume
> > > 3. ACPI battery takes 0.3s to resume.
> > > 3. all the other devices take 0.2s to resume.
> > >
> > > sata, usb and ACPI battery are independent device groups.
> > > If we use multiple async domains, we can reduce the total device resume
> > > time from 2.3s to a little more than 1.4s because there are a lot of
> > > sleep in usb resume process.
> > > But if we share the global async domain, the total resume time can only
> > > be reduced to about 2.1s because sata, usb and ACPI battery are actually
> > > resumed synchronously.
> >
> > Well, first, I'm not really sure that using the async _boot_ infrastructure for
> > that is a good choice.
>
> IMO, kernel/async.c provides good interfaces that can be used not only
> in boot phrase.
> it's generic enough for suspend/resume.

Well, if that were not your opinion, you certainly wouldn't post patches using
it. :-)

> > During suspend-resume we know dependencies between
> > devices beforehand, at least in theory, so we can use them.
> >
> that's why I use multiple async domains. :)
> One domain for a device group.

And how exactly the device groups are defined?

> > In particular, we have to make sure that parent devices will not be suspended
> > until all of their children have been suspended and children devices will not
> > be resumed before the parents.
>
> that's not enough.
> ïFor examples,
> ACPI battery and EC are independent devices, but EC must be resumed
> before battery because battery driver may access some EC address space
> during resume time.
> Of course the problem I describe above doesn't exist because ACPI
> battery driver doesn't have .resume method right now.
> But this is the case that works well in the current code but can not be
> converted to async device resume.
>
> > The current code handles this quite
> > efficiently, so I wonder what you're going to do not to break it.
> >
> sorry I don't quite understand.
>
> > Second, you seem to think that it only makes sense to execute ->suspend()
> > and ->resume() asynchronously (or in parallel), while for example in the case
> > of PCI ->suspend_noirq() and ->resume_noirq() also contain code that
> > potentially can take quite some time to execute.
> >
> IMO, this patch set just provides a framework that can be used for
> suspend/resume/shutdown optimization, and it doesn't solve all the
> problem at one time.
>
> IMO, the next step we should do is:
> 1. analyze the suspend/resume/shutdown process and find out the hotspot,
> i.e. which drivers suspend/resume slowly
> 2. If it's software problem that we can fix in the driver, fix it.
> like commit 24920c8a6358bf5532f1336b990b1c0fe2b599ee
> 3. If the hardware is slow, try to do it asynchronously.
> like I did in patch 8/8.
>
> This framework makes it quite easy to register an async device group.
>
> And even the suspend_noirq and resume_noirq can be covered easily with
> this framework.
> For example,
> 1. introduce two new device async actions.
> ïDEV_ASYNC_SUSPEND_NOIRQ/DEV_ASYNC_RESUME_NOIRQ
> just like what I did in patch 4/8, 5/8 and 6/8.
> 2. find out which device is slow in ->suspend_noirq and ->resume_noirq
> 3. see if we can find an async device group that including this device.
> 4. if yes, register this new async device group,
> with the type DEV_ASYNC_SUSPEND_NOIRQ/DEV_ASYNC_RESUME_NOIRQ
>
> > Finally, I don't really understand what the code in the $subject patch is
> > supposed to do. In particular, what's the purpose of dev_action()?
> > It only seems to check if func is not NULL right now. Also, you define
> > DEV_ASYNC_ACTIONS_ALL as 0, so the condition
> > if (!(DEV_ASYNC_ACTIONS_ALL & type)) in dev_async_register() is always
> > satisfied.
>
> please refer to patch 4/8 and 5/8 and 6/8
>
> Patch 2/8 is just a framework. No device async action is support yet.

OK

So please remove the redundant return statements from there and please don't
use (void *) for passing function pointers.

> And in Patch 4, 5, 6, we introduced three different types of device
> async actions, DEV_ASYNC_SUSPEND, DEV_ASYNC_RESUME and
> DEV_ASYNC_SHUTDOWN.
> I tried to split a big patch into several small patches.
> And it suggests how to adding a new device async type, like
> DEV_ASYNC_SUSPEND_NOIRQ, DEV_ASYNC_RESUME_NO_IRQ, etc. :)
>
> > Can we please discuss this thoroughly before any new patches are sent?
> >
> do I still need to resend the patch?

Yes, please.

> If yes, I'll resend patch 2/8, 3/8, 4/8, 5/8, 6/8 as a new big one. :)

Yes, please, I think that would help to understand the design.

Best,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/