Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

From: Djalal Harouni
Date: Fri Feb 17 2017 - 03:39:35 EST


On Fri, Feb 17, 2017 at 2:57 AM, Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
> James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> writes:
>
>> On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
>>> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
>>>
>>> [..]
>>> > > Two levels of checks will simplify this a bit. Top level inode
>>> > > will belong to the user namespace of caller and checks should
>>> > > pass. And mounter's creds will have ownership over the real inode
>>> > > so no additional namespace shifting required there.
>>> >
>>> > That's the problem: for a marked mount, they don't.
>>>
>>> In this new model it does not fit directly.
>>>
>>> I was playing with a slightly different approach and modified patches
>>> so that real root still does the mounting and takes an mount option
>>> which specifies which user namespace we want to shift into. Thanks to
>>> Eric for the idea.
>>>
>>> mount -t shiftfs -o userns_fd=<fd> source shifted-fs
>
>> This is a non-starter because it doesn't work for the unprivileged use
>> case, which is what I'm really interested in.
>
> But I believe it does. It just requires a bit more work for in the
> shiftfs filesystem above. It should be perfectly possible with the help
> of newuidmap to create a user namespace with the desired mappings.
>
> My understanding is that Vivek started with requiring root to mount the
> filesystem only as a simplification during development, and that the
> plan is to get the basic use case working and then allow unprivileged
> mounting.
>
>> For fully unprivileged
>> containers you don't have an orchestration system to ask to build the
>> container. You can get init scripts to set stuff up for you, like the
>> marks, but ideally it should just work even without that (so an inode
>> flag following project semantics seems really appealing), but after
>> that the unprivileged user should be able to build their own
>> containers.
>>
>> As you saw from the reply to Eric, this approach (which I have tried)
>> also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.
>>
>
> From what I can see we have two cases we care about.
> A) A non-default mapping from the filesystem to the rest of the system
> and roughly s_user_ns provides that but we need a review of the
> filesystems to make certain something has not been forgotten.
>
> B) A filesystem image sitting around in a directory somewhere that
> we want to map differently into different user namespaces while
> using the same files as backing store.
>
> For the second case what is interesting technically is that we want
> multiple mappings. A user namespace appears adequate to specify those
> extra mappings (effectively from kuids to kuids).
>
> So we need something to associate the additional mapping with a
> directory tree. A stackable filesystem with it's own s_user_ns field
> appears a very straight forward way to do that. Especially if it can
> figure out how to assert that the underlying filesystem image is
> read-only (doesn't overlayfs require that?). Making the entire stack
> read-only.
>
> I don't see a problem with that for unprivileged use (except possibly
> the read-only enforcement on the unerlying directory tree).
>
> What Vivek is talking about appears to be perfectly correct. Performing
> the underlying filesystem permission checks as the possibly unprivileged
> user who mounted shiftfs. After performing a set of permission checks
> (at the shiftfs level) as the user who is accessing the files.
>
>
> . . .
>
>
> I think I am missing something but I completely do not understand that
> subthread that says use file marks and perform the work in the vfs.
> The problem is that fundamentally we need multiple mappings and I don't
> see a mark on a file (even an inherited mark) providing the mapping so I
> don't see the point.
>
> Which leaves two possible places to store the extra mapping. In the
> struct mount. Or in a stacked filesystem super_block. For a stacked
> filesystem I can see where to store the extra translation. In the upper
> filesystems upper inode. And we can perform the practical permission
> check on the upper inode as well.
>
> For a vfs level solution it looks like we would have to change all of
> the permission checking code in the kernel to have a special case for
> this kind of mapping. Which does not sound maintainable.

Facts: for basic permissions: 3 files changed, 19 insertions(+), 6
deletions(-) https://lkml.org/lkml/2016/5/4/417

That made permissions work for basically *all* filesystems. However
yes it does not handle xattr acls...

> So at the moment I don't think a vfs level solution makes any sense.
>
The permissions change was already done when userns were merged. What
you may need is VFS accessors, instead of working directly on
inode->i_uid ask the VFS to give you the right i_uid (which can also
be the case of projectid proposed by Christoph iff I got it right...)
you need it for both ways: to report to userspace and the other way to
pass it to the underlying filesystem for writes/quota which Dave
Chinner pointed out.

Any way seems the ship has settled, so my thoughts at that time were
to follow the change made for i_uid_read(), i_gid_read() helpers where
userns were merged. The code comment says: "Helper functions so that
in most cases filesystems will not need to deal directly with kuid_t
and kgid_t" so the start was from there: VFS should be the one to
handle everything using accessors for both directions. Now if you guys
think that having multiple user namespaces contexts for every
container, mount namepsace user namespace, s_user_ns and shiftfs user
ns ... or multiple APIs that will just add confusion, me I can see
this directly with orchestration/containers developers they just don't
understand what's happening... ? they want something like bind mounts!
A new filesystem is a new filesystem.

Maybe Eric you will find something useful from these comments.

Thanks!


--
tixxdz