Re: [patch] PID namespace design bug, workaround

From: Ingo Molnar
Date: Sat Nov 03 2007 - 16:13:33 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Fri, 2 Nov 2007, Dave Hansen wrote:
> >
> > There are certainly more of these, but here is one In the futex
> > userspace address, we install the current pid's vnr into a userspace
> > address.
>
> Now, realistically, why not just say "you can't use these things
> across namespaces"? Does anybody really care? After all, somebody who
> screws this up only screws himself, not anybody else.

i see two main categories of problems:

- one problem is that this condition is 'invisible'. If two namespaces
happen to access the same robust futex (say a yum update from two
PID namespaces sharing the same read-mostly filesystem) there's silent
breakage and data corruption due to PID overlap. The other
namespaces have no such problems. I think the "dont do that" answer is
lame because most apps _will_ work across PID namespaces because
things like fcntl based locking does work. And there's no valid
technical excuse why futexes shouldnt work: it's all controlled by the
same native kernel, there's no untrusted network separating the nodes,
etc.

- so via this we isolate an important category of syscalls from
cross-namespace use perhaps forever. Pick just about any other kernel
resource and they can be shared between namespaces. But not futexes -
which happen to be the most scalable locking primitive and people will
almost certainly want to use them across namespaces. A
completely new breed of futexes has to be introduced and trickled
through userspace and all the architectures to make it work again
across namespaces. Who will do that work? Generally the people who
introduce a new concept are the ones who should do that. But in this
case they are apparently not interested in making it generic enough
(they are concentrated on their 'isolate it all' aspect) so
nobody else will do and we are stuck with an incomplete concept.

The answer of user-space/apps is predictable: they'll gravitate towards
the path of least resistance, and that will be "dont use futexes". PID
namespaces basically single out an important API category and use the
natural pressure of the other 300 syscalls and tens of thousands of apps
against this category. Linux is basically used against itself. The
counter-force is relatively weak and there's no solution available _at
all_ presently so it's not even the fight of patches against each other,
it's the sheer lack of a feature which has an obvious end-result.

We've already got way too many incomplete concepts and APIs in the
kernel. Maybe i'm over-worrying, but i fear we end up like with
capabilities or sendfile - code merged too soon and never completed for
many years - perhaps never completed at all. VMS and WNT did those
things a bit better i think - their API frameworks were/are pervasive
and complete, even in the corner cases.

Whether it's the right approach to force reasonable perfection of
frameworks like this from the get go is another question - but in
practice even for relatively popular new APIs like epoll we see a way
too slow movement towards the 'completion of the API', and that hinders
adoption of new APIs very much. (With splice being a notable exception -
there the central concept was so strong that it quickly pushed itself to
total completion - combined with a capable maintainer of the API.) But
it's not that easy for futexes and we put another roadblock in the path
of futexes.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/