Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace
From: Ian Kent
Date: Mon Feb 23 2015 - 19:50:49 EST
On Mon, 2015-02-23 at 09:52 -0500, J. Bruce Fields wrote:
> On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote:
> > On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> > > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > > > "J. Bruce Fields" <bfields@xxxxxxxxxxxx> writes:
> > > >
> > > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> > > >
> > > > >> The case of nfsd state-recovery might be similar but you'll need to help
> > > > >> me out a bit with that too.
> > > > >
> > > > > Each network namespace can have its own virtual nfs server. Servers can
> > > > > be started and stopped independently per network namespace. We decide
> > > > > which server should handle an incoming rpc by looking at the network
> > > > > namespace associated with the socket that it arrived over.
> > > > >
> > > > > A server is started by the rpc.nfsd command writing a value into a magic
> > > > > file somewhere.
> > > >
> > > > nit. Unless I am completely turned around that file is on the nfsd
> > > > filesystem, that lives in fs/nfsd/nfs.c.
> > > >
> > > > So I bevelive this really is a case of figuring out what we want the
> > > > semantics to be for mount and propogating the information down from
> > > > mount to where we call the user mode helpers.
> > >
> > > Oops, I agree. So when I said:
> > >
> > > The upcalls need to happen consistently in one context for a
> > > given virtual nfs server, and that context should probably be
> > > derived from rpc.nfsd's somehow.
> > >
> > > Instead of "rpc.nfsd's", I think I should have said "the mounter of
> > > the nfsd filesystem".
> > >
> > > Which is already how we choose a net namespace: nfsd_mount and
> > > nfsd_fill_super store the current net namespace in s_fs_info. (And then
> > > grep for "netns" to see the places where that's used.)
> >
> > This is going to be mostly a restatement of what's already been said,
> > partly for me to refer back to later and partly to clarify and confirm
> > what I need to do, so prepare to be bored.
> >
> > As a result of Oleg's recommendations and comments, the next version of
> > the series will take a reference to an nsproxy and a user namespace
> > (from the init process of the calling task, while it's still a child of
> > that task), it won't carry around task structs. There are still a couple
> > of questions with this so it's not quite there yet.
> >
> > We'll have to wait and see if what I've done is enough to remedy Oleg's
> > concerns too. LOL, and then there's the question of how much I'll need
> > to do to get it to actually work.
> >
> > The other difference is obtaining the context (now nsproxy and user
> > namspace) has been taken entirely within the usermode helper. I think
> > that's a good thing from the calling process isolation requirement. That
> > may need to change again based on the discussion here.
> >
> > Now we're starting to look at actual usage it's worth keeping in mind
> > that how to execute within required namespaces has to be sound before we
> > tackle use cases that have requirements over this fundamental
> > functionality.
> >
> > There are a couple of things to think about.
> >
> > One thing that's needed is how to work out if the UMH_USE_NS is needed
> > and another is how to provide provide persistent usage of particular
> > namespaces across containers. The later probably will relate to the
> > origin of the file system (which looks like it will be identified at
> > mount time).
> >
> > The first case is when the mount originates in the root init namespace
> > and most of the time (if not all the time) the UMH_USE_NS doesn't need
> > to be set and the helper should run in the root init namspace.
>
> The helper always runs in the original mount's container. Sometimes
> that container is the init container, yes, but I don't see what value
> there is in setting a flag in that one case.
Yep. that's pretty much what I meant.
>
> > That
> > should work for mount propagation as well with mounts bound into a
> > container.
> >
> > Is this also true for automounted mounts at mount point crossing? Or
> > perhaps I should ask, should automounted NFS mounts inherit the property
> > from their parent mount?
>
> Yes. If we run separate helpers in each container, then the superblocks
> should also be separate (so that one container can't poison cached
> values used by another). So the containers would all end up with
> entirely separate superblocks for the submounts.
That's almost what I was thinking.
The question relates to a mount for which the namespace proxy would have
been set at mount time in a container and then bound into another
container (in Docker by using the "--volumes-from <name>"). I believe
the namespace information from the original mount should always be used
when calling a usermode helper. This might not be a sensible question
now but I think it needs to be considered.
>
> That seems inefficient at least, and I don't think it's what an admin
> would expect as the default behavior.
LOL, but the best way to manage this is to set the namespace information
at mount time (as Eric mentioned long ago) and use that everywhere. It's
consistent and it provides a way for a process with appropriate
privilege to specify the namespace information.
>
> > The second case is when the mount originates in another namespace,
> > possibly a container. TBH I haven't thought too much about mounts that
> > originate from namespaces created by unshare(1) or other source yet. I'm
> > hoping that will just work once this is done, ;)
>
> So, one container mounts and spawns a "subcontainer" which continues to
> use that filesystem? Yes, I think helpers should continue to run in the
> container of the original mount, I don't see any tricky exception here.
That's what I think should happen too.
>
> > The last time I tried binding NFS mounts from one container into another
> > it didn't work,
>
> I'm not sure what you mean by "binding NFS mounts from one container
> into another". What exactly didn't work?
It's the volumes-from Docker option I'm thinking of.
I'm not sure now if my statement is accurate. I'll need to test it
again. I thought I had but what didn't work with the volumes-from might
have been autofs not NFS mounts.
Anyway, I'm going to need to provide a way for clients to say "calculate
the namespace information and give me an identifier so it can be used
everywhere for this mount" which amounts to maintaining a list of the
namespace objects.
I'm not sure yet if I should undo some of what I've done recently or
leave it for users who need a straight "execute me now within the
current namespace".
>
> --b.
>
> > but if we assume that will work at some point then, as
> > Bruce points out, we need to provide the ability to record the
> > namespaces to be used for subsequent "in namespace" execution while
> > maintaining caller isolation (ie. derived from the callers init
> > process).
> >
> > I've been aware of the need for persistence for a while now and I've
> > been thinking about how to do it but I don't have a clear plan quite
> > yet. Bruce, having noticed this, has described details about the
> > environment I have to work with so that's a start. I need the thoughts
> > of others on this too.
> >
> > As a result I'm not sure yet if this persistence can be integrated into
> > the current implementation or if additional calls will be needed to set
> > and clear the namespace information while maintaining the needed
> > isolation.
> >
> > As Bruce says, perhaps the namespace information should be saved as
> > properties of a mount or perhaps it should be a list keyed by some
> > handle, the handle being the saved property. I'm not sure yet but the
> > later might be unnecessary complication and overhead. The cleanup of the
> > namespace information upon summary termination of processes could be a
> > bit difficult, but perhaps it will be as simple as making it a function
> > of freeing of the object it's stored in (in the cases we have so far
> > that would be the mount).
> >
> > So, yes, I've still got a fair way to go yet, ;)
> >
> > Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/