Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Thu May 05 2016 - 18:24:59 EST


On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> >
> > * Update documentation and remove some ambiguity about the feature.
> > Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> >
> >
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
>
> [...]
>
> > As an example if the mapping 0:65535 inside mount namespace and outside
> > is 1000000:1065536, then 0:65535 will be the range that we use to
> > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > data. They represent the persistent values that we want to write to the
> > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > before, it gives portability and allows to use the previous mapping
> > which was freed for another root filesystem...
>
> So let me get this straight. Two /isolated/ containers, different
> UID/GID mappings, sharing the same files and directories. Create a
> new file in a writeable directory in container 1, namespace
> information gets stripped from on-disk uid/gid representation.
>
> Container 2 then reads that shared directory, finds the file written
> by container 1. As there is no no namespace component to the uid:gid
> stored in the inode, we apply the current namespace shift to the VFS
> inode uid/gid and so it maps to root in container 2 and we are
> allowed to read it?

Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
mount namespace which only root can set or if it was already set in
parent, and have access to the shared dir which the container manager
should also configure before... even with that if in container 2 the
shift flag is not set then there is no mapping and things work as they
are now, but yes this setup is flawed! they should not share rootfs,
maybe in rare cases, some user data that's it.


> Unless I've misunderstood something in this crazy mapping scheme,
> isn't this just a vector for unintentional containment breaches?
>
> [...]
>
> > Simple demo overlayfs, and btrfs mounted with vfs_shift_uids and
> > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > create two user namesapces every one with its own mapping and where
> > container-uid-2000000 will pull changes from container-uid-1000000
> > upperdir automatically.
>
> Ok, forget I asked - it's clearly intentional. This is beyond
> crazy, IMO.

This setup is flawed! that example was to show that files show up with
the right mapping with two different user namespaces. As Andy noted
they should have a backing device...

Anyway by the previous paragraph what I mean is that when the container
terminates it releases the UID shift range which can be re-used later
on another filesystem or on the same previous fs... whatever. Now if
the range is already in use, userspace should grab a new range give it
a new filesystem or a previous one which doesn't need to be shared and
everything should continue to work...


simple example with loop devices..., however the image should be a GPT
(GUID partition table) or an MBR one...

$ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
$ mkfs.ext4 /tmp/fedora-newtree.raw
...
$ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree
$ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim
$ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree
$ sudo ~/container --uidmap [1000000:1065536 or
2000000:2065536 or
3000000:3065536 ....}
(That's the mapping outside of the container)



> > 3) ROADMAP:
> > ===========
> > * Confirm current design, and make sure that the mapping is done
> > correctly.
>
> How are you going to ensure that all filesystems behave the same,
> and it doesn't get broken by people who really don't care about this
> sort of crazy?

By trying to make this a VFS mount namespace parameter. So if the
shift is not set on on the mount namespace then we just fallback to
the current behaviour! no shift is performed.

later of course I'll try xfstests and several tests...

Does this answer your question ?


> FWIW, having the VFS convert things to "on-disk format" is an
> oxymoron - the "V" in VFS means "virtual" and has nothing to do with
> disks or persistent storage formats. Indeed, let's convert the UID
> to "on-disk" format for a network filesystem client....
hehe! sure it's not already done? it can be changed to "to-fs" !


> .....
> > * Add XFS support.
>
> What is the problem here?

Yep, sorry! just lack of time from my part! XFS currently is a bit aware
of kuid/kgid mapping on its own, and I just didn't had the appropriate
time! Will try to fix it next time.

> Next question: how does this work with uid/gid based quotas?

If you do a shift you should know that you will share quota on disk. In
all cases to activate the behaviour you have to set the options during
mount too... but it will be documented and recommended to have
different divce nodes, loop device, MBR or GPT partitions, block
devices... lvm or anything else mounted with xfs or any other filesystem
that supports this shift and set the flags at mount time.


> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx


Thank you!

--
Djalal Harouni
http://opendz.org