Re: [GIT PULL] PM updates for 2.6.33

From: Linus Torvalds
Date: Mon Dec 07 2009 - 00:58:07 EST

Next message: Dmitry Torokhov: "Re: [GIT PULL] PM updates for 2.6.33"
Previous message: Luming Yu: "Re: [RFC PATCH] Add TRACE_IRQFLAGS_SUPPORT, LOCKDEP_SUPPORT then enable ftrace for ia64"
In reply to: Zhang Rui: "Re: [GIT PULL] PM updates for 2.6.33"
Next in thread: Linus Torvalds: "Re: [GIT PULL] PM updates for 2.6.33"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 7 Dec 2009, Zhang Rui wrote:
>
> Hi, Linus,
> can you please look at this patch set and see if the idea is right?
> http://marc.info/?l=linux-kernel&m=124840449826386&w=2
> http://marc.info/?l=linux-acpi&m=124840456826456&w=2
> http://marc.info/?l=linux-acpi&m=124840456926459&w=2
> http://marc.info/?l=linux-acpi&m=124840457026468&w=2
> http://marc.info/?l=linux-acpi&m=124840457126471&w=2

So I'm not entirely sure about that patch-set, but the thing I like about
it is how drivers really sign up to it one by one, rather than having all
PCI devices automatically signed up for async behavior.

That said, the thing I don't like about it is some of the same thing I
don't necessarily like about the series in Rafael's tree either: it looks
rather over-designed with the whole infrastructure for async device logic
(your patch in http://marc.info/?l=linux-acpi&m=124840456926459&w=2). How
would you explain that whole async_dev_register() logic in simple terms to
somebody else?

(I think yours is simpler that the one in the PM tree, but I dunno. I've
not really compared the two).

So let me explain my dislike by trying to outline some conceptually simple
thing that doesn't have any call-backs, doesn't have any "classes",
doesn't require registration etc. It just allows drivers at any level to
decide to do some things (not necessarily everything) asynchronously.

Here's the outline:

- first off: drivers that don't know that they nest clearly don't do
anything asynchronous. No "PCI devices can be done in parallel" crap,
because they really can't - not in the general case. So just forget
about that kind of logic entirely: it's just wrong.

- the 'suspend' thing is a depth-first tree walk. As we suspend a node,
we first suspend the child nodes, and then we suspend the node itself.
Everybody agrees about that, right?

- Trivial "async rule": the tree is walked synchronously, but as we walk
it, any point in the tree may decide to do some or all of its suspend
asynchronously. For example, when we hit a disk node, the disk driver
may just decide that (a) it knows that the disk is an independent thing
and (b) it's hierarchical wrt it's parent so (c) it can do the disk
suspend asynchronously.

- To protect against a parent node being suspended before any async child
work has completed, the child suspend - before it kicks off the actual
async work - just needs to take a read-lock on the parent (read-lock,
because you may have multiple children sharing a parent, and they don't
lock each other out). Then the only thing the asynchronous code needs
to do is to release the read lock when it is done.

- Now, the rule just becomes that the parent has to take a write lock on
itself when it suspends itself. That will automatically block until
all children are done.

Doesn't the above sound _simple_? Doesn't that sound like it should just
obviously do the right thing? It sounds like something you can explain as
a locking rule without having any complex semantic registration or
anything at all.

Now, the problem remains that when you walk the device tree starting off
all these potentially asynchronous events, you don't want to do that
serialization part (the "parent suspend") as you walk the tree - because
then you would only ever do one single level asynchronously. Which is why
I suggested splitting the suspend into a "pre-suspend" phase (and a
"post-resume" one). Because then the tree walk goes from

# single depth-first thing
suspend(root)
{
for_each_child(root) {
// This may take the parent lock for
// reading if it does something async
suspend(child);
}

// This serializes with any async children
write_lock(root->lock);
suspend_one_node(root);
write_unlock(root->lock);
}

to

# Phase one: walk the tree synchronously, starting any
# async work on the leaves
suspend_prepare(root)
{
for_each_child(root) {
// This may take the parent lock for
// reading if it does something async
suspend_prepare(child);
}
suspend_prepare_one_node(root);
}

# Phase two: walk the tree synchronously, waiting for
# and finishing the suspend
suspend(root)
{
for_each_child(root) {
suspend(child);
}
// This serializes with any async children started in phase 1
write_lock(root->lock);
suspend_one_node(root);
write_unlock(root->lock);
}

and I really think this should work.

The advantage: untouched drivers don't change ANY SEMANTICS AT ALL. If
they don't have a 'suspend_prepare()' function, then they still see that
exact same sequence of 'suspend()' calls. In fact, even if they have
children that _do_ have drivers that have that async phase, they'll never
know, because that simple write-semaphore trivially guarantees that
whether there was async work or not, it will be completed by the time we
call 'suspend()'.

And drivers that want to do things asynchronously don't need to register
or worry: all they do is literally

- move their 'suspend()' function to 'suspend_prepare()' instead

- add a

down_read(dev->parent->lock);
async_run(mysuspend, dev);

to the point that they want to be asynchronous (which may be _all_ of
it or just some slow part). The 'mysuspend' part would be the async
part.

- add a

up_read(dev->parent->lock);

to the end of their asynchronous 'mysuspend()' function, so that when
the child has finished suspending, the parent down_write() will finally
succeed.

Doesn't that all sound pretty simple? And it has a very clear architecture
that is pretty easy to explain to anybody who knows about traversing trees
depth-first.

No complex concepts. No change to existing tested drivers. No callbacks,
no flags, no nothing. And a pretty simple way for a driver to decide: I'll
do my suspends asynchronously (without parent drivers really ever even
having to know about it).

I dunno. Maybe I'm overlooking something, but the above is much closer to
what I think would be worth doing.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dmitry Torokhov: "Re: [GIT PULL] PM updates for 2.6.33"
Previous message: Luming Yu: "Re: [RFC PATCH] Add TRACE_IRQFLAGS_SUPPORT, LOCKDEP_SUPPORT then enable ftrace for ia64"
In reply to: Zhang Rui: "Re: [GIT PULL] PM updates for 2.6.33"
Next in thread: Linus Torvalds: "Re: [GIT PULL] PM updates for 2.6.33"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]