Re: [PATCH] - race-free suspend. Was: Re: [linux-pm] [PATCH 0/8] Suspend block api (version 8)

From: Neil Brown
Date: Wed Jun 02 2010 - 19:22:01 EST

Next message: Sven Eckelmann: "Re: [B.A.T.M.A.N.] [PATCH] drivers/staging/batman-adv: Convert MAC_FMT to %pM"
Previous message: Dave Chinner: "Re: Wrong DIF guard tag on ext2 write"
In reply to: Florian Mickler: "Re: [PATCH] - race-free suspend. Was: Re: [linux-pm] [PATCH 0/8] Suspend block api (version 8)"
Next in thread: Dmitry Torokhov: "Re: [PATCH] - race-free suspend. Was: Re: [linux-pm] [PATCH 0/8]Suspend block api (version 8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 2 Jun 2010 21:05:21 +0200
Florian Mickler <florian@xxxxxxxxxxx> wrote:

> Could someone perhaps make a recap on what are the problems with the
> API? I have no clear eye (experience?) for that (or so it seems).

Good interface design is an acquired taste. And it isn't always easy to
explain satisfactorily. But let me try to explain what I see.

A key aspect of a good interface is unity, and sometimes uniformity.
For example, the file descriptor is a key element to the unity of the
Unix (and hence Posix and Linux) interface. "everything is a file" and
even when it isn't, everything is accessed via a file descriptor.
This is one of the reasons that signals cause so much problem when
programming in Unix - they aren't files, don't have file descriptors and
don't look them at all. That is why signalfd was created, to try to tie
signals back in to the 'file descriptor' model.

So unity is important. Adding new concepts is best done as an extension of
an existing concept. That means that all the infrastructure, not only code
and design but also developer understanding, can be leveraged to help get the
new concept *right* first time. It also means that using the new concept is
easier to learn.

So the problem with the wake-locks / suspend-blockers (and I've actually come
to like the first name much more) is that it introduces a new concept without
properly leveraging existing concepts.

The new concept is opportunistic suspend, though maybe a better name would be
automatic suspend - not sure.

There appear to be two ways you can get opportunistic suspend leveraging
already-existing concepts.

One is to leverage the current "power/state = mem" architecture and just let
userspace choose the opportune moment. The user-space daemon that chooses
this moment would need full information about states of various things to do
this, but sysfs is good at providing full information about what is in the
kernel, and there are dozens of ways for user-space processes to communicate
their state to each other. So this is all doable today without introducing
new design concepts.
Except there is a race between suspending and new events, so we just need to
fix the race. Hence my patch.

The other is to leverage the more general power management infrastructure.
We can already detect when the processor won't be needed for a while, and put
it into a low-power state. We can already put devices to sleep when they
aren't being used. We can just generalise this so that we can detect when
all devices are either unused, or capable of supporting an S3 transition, and
detect when the next timer interrupt is far enough in the future that S3
latency wont be a problem - set the rtc alarm to match the next timer and go
to S3. All completely transparent. (I admit I'm not entirely sure what the
qos that is being discussed brings us, but I assume it is better quality
rather than correctness).

So there are at least two different ways that opportunistic suspend could be
integrated into existing infrastructure with virtually no change of interface
and no new concepts - just refining or extending existing concepts.

Yet the approach used and preferred by android is to create something
substantially new. Yes, it does use the existing suspend infrastructure, but
in a very new and different way. Suspend is now initiated by the kernel, but
in a completely different way to the ways that the kernel already initiates
power saving. So we have two infrastructures for essentially one task.
Looked at the other way, it moves the initiation of suspend from user-space
into the kernel, and then allows user-space to tell the kernel not to suspend.
That to me is very ugly.
In general, the kernel should provide information to user-space, and provide
services to user-space, and user-space should use that information to decide
what services to request. This is the essence the "policy lives in
user-space" maxim.
The Android code has user-space giving information to the kernel, so the
kernel can make a policy decision. This approach is generally less flexible
and is best avoided.

Just as a bit of background, let's think about some of the areas in the
kernel where the kernel does make policy decisions based on user-space input.
- the scheduler - based on 'nice' setting it decided who should run when
- the VM - based on read-ahead settings, madvise/fadvise, recent-use
heuristics, it tries to decide what to keep in memory and what to swap out.
I think those are the main ones. There are other smaller fish like the
choice of IO scheduler and various ways to tune network connections.

But the two big ones are perfect examples of subsystems that have proven very
hard to get *right*, and have been substantially re-written more than once.
In each case, the difficulty wasn't how to perform the work, it was the
choice of what work to perform. It probably also involved getting different
sorts of information about the current state.

That perspective leaves me very sceptical of any design that involves making
policy decisions in the kernel. It is too easy to get wrong, then too hard
to change.
Admittedly the power subsystem does seem to make policy decisions in the
kernel, via the various governors. Though I don't know much about how these
work, it seems significant that there is a pluggable infrastructure with
multiple governors, and one of them leaves the decisions to user-space.

So that is what I see as wrong with the android API : it doesn't bring unity
by simply leveraging existing infrastructure, and it makes policy decisions
in the kernel.

>
> > It is a pity that this extra requirement was not clear from your introduction
> > to the "Opportunistic suspend support" patch.
>
> I think that the main problem was that _all_ the requirements were
> not communicated well. That caused everybody to think that their
> solution would be a better fit. You are not alone.
>
> > If that be the case, I'll stop bothering you with suggestions that can never
> > work.
> > Thanks for your time,
> > NeilBrown
>
> Don't be frustrated. What should Arve be? :)
>

Sometimes appearing frustrated can elicit a different style of response to
appearing polite and constructive... can be helpful.
And yes: I fully understand that Arve would be frustrated. There seems to
be a big disconnect in perceptions of what problem is trying to be solved, and
thus disconnects in what the solution should look like, and I suspect that
would be very frustrating all around.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sven Eckelmann: "Re: [B.A.T.M.A.N.] [PATCH] drivers/staging/batman-adv: Convert MAC_FMT to %pM"
Previous message: Dave Chinner: "Re: Wrong DIF guard tag on ext2 write"
In reply to: Florian Mickler: "Re: [PATCH] - race-free suspend. Was: Re: [linux-pm] [PATCH 0/8] Suspend block api (version 8)"
Next in thread: Dmitry Torokhov: "Re: [PATCH] - race-free suspend. Was: Re: [linux-pm] [PATCH 0/8]Suspend block api (version 8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]