Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: James Bottomley
Date: Sat May 14 2016 - 09:47:16 EST


On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> [...]
>
> > > >
> > > > The credentials are per thread, so it's a standard way of doing
> > > > credential shifting and no other threads of execution in the
> > > > same
> > > > task get access. As long as you bound the
> > > > override_creds()/revert_c
> > > > reds() pairs within the kernel, you're safe.
> > >
> > > No, and here sorry I mean shifted.
> > >
> > > current_fsuid() is global through all fs operations which means
> > > it
> > > crosses user namespaces... it was safe the days of only
> > > init_user_ns,
> > > not anymore... You give a mapping inside containers to fsuid
> > > where
> > > they don't want to have it... this allows to operate on inodes
> > > inside
> > > other containers... update current_fsuid() even if we want that
> > > user
> > > to be nobody inside the container... and later it can access the
> > > inodes of the shifted fs... and by same current of course...
> >
> > OK, I still don't understand what you're getting at. There are
> > three
> > per-thread uids: uid, euid and fsuid (real, effective and
> > filesystem).
> > They're all either settable via syscall or inherited on fork.
> > They're
> > all kernel side, meaning they're kuid_t. Their values stay
> > invariant
> > as you move through namespaces. They change (and get mapped
> > according
> > to the current user namespace setting) when you call set[fe]uid()
> > So
> > when I enter a user namespace with mapping
> >
> > 0 100000 1000
> >
> > and call setuid(0) (which sets all three). they all pick up the
> > kuid_t
> > of 100000. This means that writing a file inside the user
> > namespace
> > after calling setuid(0) appears as real uid 100000 on the medium
> > even
> > though if I call getuid() from the namespace, I get back 0. What
> > shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> > permission checks, so you can remap to any old uid on the medium
> > (although usually you'd pass in uidmap=0:100000:1000") it maps back
> > from kuid_t 100000 to kuid_t 0, which is why the container can now
> > read
> > and write the underlying medium at on-media id 0 even through root
> > inside the container has kuid_t 100000. There's no permanent
> > change of
> > fsuid and it stays at its invariant value for the thread except as
> > a
> > temporary measure to do the permission checks on the underlying of
> > the
> > shifted filesystem.
> >
> > > > > The worst thing is that current_fsuid() does not follow now
> > > > > the
> > > > > /proc/self/uid_map interface! this is a serious vulnerability
> > > > > and
> > > > > a mix of the current semantics... it's updated but using
> > > > > other
> > > > > rules...?
> > > >
> > > > current_fsuid() is aready mapped via the userns; it's already a
> > > > kuid_t at its final value. Shifting that is what you want to
> > > > remap
> > > > underlying volume uid/gid's. The uidmap/gidmap inputs to this
> > > > are
> > > > shifts on the final underlying uid/gids.
> > >
> > > => some points:
> > > Changing setfsuid() its interfaces and rules... or an indrect way
> > > to
> > > break another syscall...
> >
> > There is no change to setfsuid().
> >
> > > The userns used for *mapping* is totatly different and not
> > > standard..
> > > . losing "init_user_ns and its decendents userns *semantics*...",
> > > a
> > > yet a totatly unlinked mapping...
> >
> > There is no user namespace mapping at all. This is a simple shift,
> > kernel side, of uids and gids at their kuid_t values.
> >
> > > Breaking current_uid(),current_euid(),current_fsuid() which are
> > > mapped but in *different* user namespaces... hence different
> > > values
> > > inside namespaces... you can change your userns mapping but that
> > > current_fsuid specific one will always be remapped to some other
> > > value inside even if you don't want it... It crosses user
> > > namespaces... uid and euid are remapped according to
> > > /proc/self/uid_
> > > map, fsuid is remapped according to this new interface...
> > >
> > > Hard coding the mapping, nested containers/apps may *share* fsuid
> > > and
> > > can't get rid of it even if they change the inside userns mapping
> > > to
> > > disable, split, reduce mapped users or offer better isolation
> > > they
> > > can't... no way to make private inodes inside containers if they
> > > share the final fsuid, inside container mapping is ignored...
> > > ...
> >
> > OK, I think there's a misunderstanding about how credential
> > overrides
> > work. They're not permanent changes to the credentials, they're
> > temporary ones to get stuff done within the kernel at a temporary
> > privilege. You can make credentials permanent if you go through
> > prepare_creds()/commit_creds(), but for making them temporary you
> > do
> > prepare_creds()/override_creds() and then revert_creds() once
> > you're
> > done using them.
> >
> > If you want to see a current use of this, try fs/open.c:faccessat.
> > What it's doing is temporarily overriding fsuid with the real uid
> > to
> > check the permissions before reverting the credentials and
> > returning to
> > the user.
>
> Thank you for explaining things, but I think you should take the time
> to read this RFC and understand some problems. This is a quick dump
> of some problems that it avoids...:

I did. The problem is how to get the userns to read and write files at
the interior not the exterior id. Your solution is to thread the
mapping through the VFS and even on to the filesystems themselves to
get the mount option. I already commented that this is a bit ugly and
couldn't it be encapsulated in a filesystem. The way I approached the
problem is from the base that I do have build container roots with
shifted uid/gids because I installed them that way. So, if it already
works, one possible solution is to have a filesystem which does the
shift and mounts the shifted root somewhere in the mount tree for the
namespace to access. The point about doing it this way is that the
filesystem that does it needs no user namespace knowledge. All it does
is remap from one on disk id to another using a map function. How it
gets the map was left up to the admin in the implementation.

> In this series we don't hijack setfsuid() in an indirect way,
> setfsuid maps UIDs into current userns according to rules set by
> parent. Changing current_fsuid() to some other mapping is a way to
> allow processes to bypass that and use it to access other inodes...
> This should not change and fsuid should continue to follow these
> rules...

Both solutions do this

> A cred->fsuid solution is safe or used to be safe only inside
> init_user_ns where there is always a mapping or in context of current
> user namespace. In an other user namespace with 0:1000:1 mapping,
> you can't set it to arbitrary mapping like 0:4000:1... It will give
> confined processes access to inodes that satisfy the kuid_t 4000
> mapping and which the app/container wants to deny, they only want
> 0:1000:1. ..

OK, so both solutions are safe here too. Your safety comes from only
remapping in the userns; mine comes from the normal filesystem acl
rules: either the userns for different users all have disjoint ids
regulated by /etc/subuidmap or they're all using the same one (like
docker 1.10) in either case, you could regulate by having the mount
under a directory which is accessible only to the userns owner.

> We don't cross user namespaces, we don't use different mappings for
> cred->uid, cred->fsuid... A clean solution is to shift inodes
> UID/GID and not change fsuid to cross namespaces. Not to mention how
> it may interact with capabilities...

This is a subjective question on what constitutes "clean". I think we
both think the other solution isn't clean, so that's for others to
adjudicate.

> We follow user namespace rules and we keep "the parent defines a
> range that the children can't escape" semantics. There is a clear
> relation between user namespaces that should not be broken.

OK, so I separated the problem into a userns one, which remaps for the
processes in user space, and a vfs one which remaps the on-disk id.
However, they could be combined by allowing the userns to mount
shiftfs but only on designated filesystems and setting the uidmappings
to the same ones as the userns.

> We explicitly don't define a new user namespace mapping nor add a new
> interface for the simple reason it's: *too complicated*. We can do
> that, but no thanks! May be in future if there is a real need or
> things are clear... The current user namespace interface is getting
> standard and stable, so we just keep it that way and make it
> consistant inside VFS.

I don't accept the too complicated point. For fully unprivileged
containers, the host admin already has to set up the subuid/subgid map
files which is most of the complexity. Once that's done, the same maps
can be used to shift mount. Once it's all set up, no further
intervention is required.

> We give VFS control of that, and we make mount namespaces the central
> part of this whole logic.

Right, that's what causes the logic to thread throughout the entire vfs
and into the fs layer. The fundamental point of difference is that I'd
like a solution which encapsulates the problem rather than exposing it
to the vfs.

> We make admins life easier where they can pull container images, root
> filesystems from containers/apps hubs... verify the signature and
> start them with different mappings according to host resources... We
> don't want them to do anything. The design was planned to make it
> easier for users, it should work out of the box, and it can be used
> to handle complex stuff too, since it's flexible.

Either works easily for users. Setting stuff up is always the job of
the admin in both solutions.

> Able to support most filesystems including on-disk filesystems
> natively.

Shiftfs does this. More importantly it supports subtrees, so I can
unpack an image root on to an existing filesystem and remap it into a
container.

> Able to support disk quota according to the shifted UID/GID on-disk
> values. Especially during inode creation...

Quota can be shifted, I just wasn't sure it was necessary. If the
usual use case is for unpacked roots, chances are you want the
remapping to use the group quota of the userns owner, which they'd get
naturally so, while it's possible to remap projid, I didn't think it
needed to be done.

> Able to support ACL if requested.

Both do this.

> The user namespace mapping is kept a runtime configure option, we
> don't pin a special mapping at any time, and of course parent creator
> of user namespace is the one that can manipulate it, at the same time
> the mapping is restricted according to grandpa rules and so on...
>
> It allows unprivileged to use the VFS UID/GID shift without the
> intervention of a privileged process each time. The real privileged
> process sets the filesystem and the mount namespace the first time,
> then it should work for all nested namespaces and containers. It does
> not need the intervation of init_user_ns root to set the mapping and
> make it work, you don't have to go in and go out to setup the thing,
> etc.

Both solutions work like this. When I use this for shifted roots of
emulation containers, it's set up once at start of day. I then build
the containers unprivileged using newsubuid/newsubgid as I'm using
them. Once the shifts are done at start of day, no other admin support
is required.

James


> We don't do this on behalf of filesystems, they should explicitly
> support it. procfs and other host resource virtual filesystems are
> safe and currently they don't need shifting.
>
> We try to fix the problem where it should be fixed, and not hide
> it...