Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: James Bottomley
Date: Thu May 12 2016 - 18:24:21 EST


On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> [...]
> > > Hmm anyway you are mounting this on behalf of filesystems, so if
> > > you
> > > add the recursive thing, you will just probably make everything
> > > worse, by making any /proc, /sys dentry that's under that path
> > > shiftable, and unprivileged users can just create user namespaces
> > > and
> > > read /proc/* and all the other stuff that doesn't have capable()
> > > related to the init_user_ns host...
> >
> > That's up to the admin who does the shifting. Recursive would be
> > an
> > option if added.
>
> Hmm, not sure if you get my point... you just made it an admin
> problem where admins want to mount an image downloaded verify it and
> use it for their container with /proc...! that's another problem!

You can't allow unprivileged containers to shift uids on arbitrary
filesystems, so the admin always has to do something for the initial
setup.

> > > what if you have paths like /filesystem0/uidshiftedY/dir,
> > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir
> > > where some of them are also bind mounts that point to same dentry
> > > ?
> >
> > Without recursive semantics, you see the underlying inode. With
> > them, you see the upper vfsmnts. Shiftfs isn't idempotent, so you
> > would need to be careful about nesting. However, that's an admin
> > problem.
> >
> > > Also, you create a totally new user namespace interface here! by
> > > making your own new interface we just lose the notion of
> > > init_user_ns and its children and mapping ?
> >
> > I don't quite understand this; the only use of the init_user_ns is
> > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real
> > admin can mount at a shifted uid/gid. Otherwise, there's no need
> > to see into the userns because filesystems see the kuid_t/kgid_t
> > which is what I'm shifting.
> >
> > > I'm not sure of the implication of all this... your user
> > > namespace mapping is not related at all to init_user_ns! it seems
> > > that it has its own init_user_ns ? does a capable() check now
> > > on a shifted filesystem relates to that and hence to your mapping
> > > or to the real init_user_ns ?
> >
> > capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
> >
> > Or is there a misunderstanding here about how user namespaces work
> > inside the kernel? The design is that the ID shift is done as you
> > cross the kernel boundary, so a filesystem, being usually all in
> > -kernel operating via the VFS interfaces, ideally never needs to
> > make any from_kuid/make_kuid calls. However, there are ways
> > filesystems can send data across the kernel/user bounary outside of
> > the usual vfs interfaces (ioctls being the most usual one) so in
> > that specific code, they have to do the kuid_t to uid_t changes
> > themselves. Shiftfs never sends data to the user outside of the
> > VFS so it never needs to do this and can operate entirely on
> > kuid_ts.
> >
> > > > There's a bit of an open question of whether it should have vfs
> > > > changes: the way the struct file f_inode and f_ops are hijacked
> > > > is a bit nasty and perhaps d_select_inode() could be made a bit
> > > > cleverer to help us here instead.
> > >
> > > I'm not sure if this PoC works... but you sure you didn't
> > > introduce a serious vulnerability here ? you use a new mapping
> > > and you update current_fsuid() creds up, which is global on any
> > > fs operation, so may be: lets operate on any inode, update our
> > > current_fsuid()... and access the rest of *unshifted filesystems*
> > > ... !?
> >
> > The credentials are per thread, so it's a standard way of doing
> > credential shifting and no other threads of execution in the same
> > task get access. As long as you bound the override_creds()/revert_c
> > reds() pairs within the kernel, you're safe.
>
> No, and here sorry I mean shifted.
>
> current_fsuid() is global through all fs operations which means it
> crosses user namespaces... it was safe the days of only init_user_ns,
> not anymore... You give a mapping inside containers to fsuid where
> they don't want to have it... this allows to operate on inodes inside
> other containers... update current_fsuid() even if we want that user
> to be nobody inside the container... and later it can access the
> inodes of the shifted fs... and by same current of course...

OK, I still don't understand what you're getting at. There are three
per-thread uids: uid, euid and fsuid (real, effective and filesystem).
They're all either settable via syscall or inherited on fork. They're
all kernel side, meaning they're kuid_t. Their values stay invariant
as you move through namespaces. They change (and get mapped according
to the current user namespace setting) when you call set[fe]uid() So
when I enter a user namespace with mapping

0 100000 1000

and call setuid(0) (which sets all three). they all pick up the kuid_t
of 100000. This means that writing a file inside the user namespace
after calling setuid(0) appears as real uid 100000 on the medium even
though if I call getuid() from the namespace, I get back 0. What
shiftfs does is hijack temporarily the kernel fsuid/fsgid for
permission checks, so you can remap to any old uid on the medium
(although usually you'd pass in uidmap=0:100000:1000") it maps back
from kuid_t 100000 to kuid_t 0, which is why the container can now read
and write the underlying medium at on-media id 0 even through root
inside the container has kuid_t 100000. There's no permanent change of
fsuid and it stays at its invariant value for the thread except as a
temporary measure to do the permission checks on the underlying of the
shifted filesystem.

> > > The worst thing is that current_fsuid() does not follow now the
> > > /proc/self/uid_map interface! this is a serious vulnerability and
> > > a mix of the current semantics... it's updated but using other
> > > rules...?
> >
> > current_fsuid() is aready mapped via the userns; it's already a
> > kuid_t at its final value. Shifting that is what you want to remap
> > underlying volume uid/gid's. The uidmap/gidmap inputs to this are
> > shifts on the final underlying uid/gids.
>
> => some points:
> Changing setfsuid() its interfaces and rules... or an indrect way to
> break another syscall...

There is no change to setfsuid().

> The userns used for *mapping* is totatly different and not standard..
> . losing "init_user_ns and its decendents userns *semantics*...", a
> yet a totatly unlinked mapping...

There is no user namespace mapping at all. This is a simple shift,
kernel side, of uids and gids at their kuid_t values.

> Breaking current_uid(),current_euid(),current_fsuid() which are
> mapped but in *different* user namespaces... hence different values
> inside namespaces... you can change your userns mapping but that
> current_fsuid specific one will always be remapped to some other
> value inside even if you don't want it... It crosses user
> namespaces... uid and euid are remapped according to /proc/self/uid_
> map, fsuid is remapped according to this new interface...
>
> Hard coding the mapping, nested containers/apps may *share* fsuid and
> can't get rid of it even if they change the inside userns mapping to
> disable, split, reduce mapped users or offer better isolation they
> can't... no way to make private inodes inside containers if they
> share the final fsuid, inside container mapping is ignored...
> ...

OK, I think there's a misunderstanding about how credential overrides
work. They're not permanent changes to the credentials, they're
temporary ones to get stuff done within the kernel at a temporary
privilege. You can make credentials permanent if you go through
prepare_creds()/commit_creds(), but for making them temporary you do
prepare_creds()/override_creds() and then revert_creds() once you're
done using them.

If you want to see a current use of this, try fs/open.c:faccessat.
What it's doing is temporarily overriding fsuid with the real uid to
check the permissions before reverting the credentials and returning to
the user.

James

> > the privileged ids down to 100000, but I have a volume which still
> > has realids, I can mount that volume using shiftfs with
> > uidmap=0:100000:1000 and it will allow this userns to read and
> > write the volume through its remapped ids.
> >
> > > For overlayfs I did write an expriment but for me it's not an
> > > overlayfs or another new filesystem problem... we are
> > > manipulating UID/GID identities...
> > >
> > > It would have been better if you did send this as a separate
> > > thread. It was a vfs:userns RFC fix which if we continue we turn
> > > it into a complicated thing! implement another new light
> > > filesystem with userns... (overlayfs...)
> > >
> > > Will follow up if the appropriate thread is created, not here, I
> > > guess it's ok ?
> >
> > Well, I can resend the patch as a separate thread when I've fixed
> > some of the problems viro pointed out.
> >
> > James
> >
> > > > James
> > > >
> > >
> > > Thank you for your feedback!
> > >
> > >
> >
>
> Thanks!
>