Re: [patch] PID namespaces

From: Ingo Molnar
Date: Sun Nov 04 2007 - 05:39:24 EST



(changed the Subject line)

* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Sat, 3 Nov 2007, Ingo Molnar wrote:
> >
> > - one problem is that this condition is 'invisible'. If two
> > namespaces happen to access the same robust futex (say a yum
> > update from two PID namespaces sharing the same read-mostly
> > filesystem) there's silent breakage and data corruption due to PID
> > overlap.
>
> .. and this is in *no* way different from thousands of applications
> that write their pid to lock-files, and others decide that it's
> "stale" because using "kill(pid, 0)" returns that the pid doesn't
> exist any more.
>
> The solution? You can't do that kind of locking over NFS, or across
> pid namespaces. Nobody blames NFS or pid namespaces for it.

the difference to NFS is that for PID namespaces we do have a single
trusted kernel that fully controls all the domains so there's no obvious
"hard barrier of trust" that people could perceive as a showstopper.

We've got a global kernel and unlike other namespaces there's (almost)
no "directed allocation" done of specific PIDs (unlike files, socket
addresses or fds). So the PID is a cookie that is 99.9% shaped _by the
kernel already_. [there are a few exceptions but those are much less
problematic than the lack of global PIDs is] So we might as well shape
the cookies in a way that keeps them global. What is the technological
reason for not keeping PIDs globally unique? We've cited a good number
of reasons why it's desirable - it's a pretty damn useful cookie for
identifying tasks. (it's also very scalable - PID -> task lookup is
completely lockless.)

I.e. keep the namespace functionality but use a modulo 1.000.000 base
for the PIDs so that it all looks nicer to the user. Minimal visibility
difference but maximum compatibility. (The resulting limits are
reasonable: 1 million tasks per container and 4 million containers on a
single 32-bit box.) We could still restrict cross-namespace API use but
all the cases where a global PID is desirable would still all work. I
might be missing something obvious though.

The reason why i bring this up now is because 2.6.24 is an
all-or-nothing flag day for this detail. Once it's out there we wont
realistically be able to change any of these details. (And in general
i'm very supportive of the containers concept - a year ago at the KS i
was one of the very few proponents of quickly merging containers into
the kernel.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/