Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

From: Marian Marinov
Date: Tue Apr 29 2014 - 20:11:42 EST


On 04/30/2014 03:01 AM, Stéphane Graber wrote:
On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:
On Tue, Apr 29, 2014 at 4:47 PM, Stéphane Graber <stgraber@xxxxxxxxxx> wrote:
On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <mm@xxxxxx> wrote:
On 04/30/2014 01:45 AM, Andy Lutomirski wrote:

On 04/29/2014 03:29 PM, Serge Hallyn wrote:

Quoting Marian Marinov (mm-108MBtLGafw@xxxxxxxxxxxxxxxx):

On 04/30/2014 01:02 AM, Serge Hallyn wrote:

Quoting Marian Marinov (mm-108MBtLGafw@xxxxxxxxxxxxxxxx):

On 04/29/2014 09:52 PM, Serge Hallyn wrote:

Quoting Theodore Ts'o (tytso-3s7WtUTddSA@xxxxxxxxxxxxxxxx):

On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:


I'm proposing a fix to this, by replacing the
capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()->user_ns,
CAP_LINUX_IMMUTABLE).


Um, wouldn't it be better to simply fix the capable() function?

/**
* capable - Determine if the current task has a superior
capability in effect
* @cap: The capability to be tested for
*
* Return true if the current task has the given superior
capability currently
* available for use, false if not.
*
* This sets PF_SUPERPRIV on the task if the capability is
available on the
* assumption that it's about to be used.
*/
bool capable(int cap)
{
return ns_capable(&init_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for "the current task", and I
can't imagine any use case, where user namespaces are in effect,
where
using init_user_ns would ever make sense.


the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying
'ns_capable(current_cred()->userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch. I'm not sure, but IIUC it should
be
safe to check against the userns owning the inode?


So what you are proposing is to replace
'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.


Right, and I think the two operations you're looking at seem sane
to allow.


If you are ok with this patch, I will fix all file systems and send
patches.


Sounds good, thanks.

Signed-off-by: Marian Marinov <mm-NV7Lj0SOnH0@xxxxxxxxxxxxxxxx>


Acked-by: Serge E. Hallyn
<serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@xxxxxxxxxxxxxxxx>


Wait, what?

Inodes aren't owned by user namespaces; they're owned by users. And any
user can arrange to have a user namespace in which they pass an
inode_capable check on any inode that they own.

Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
entirely.


The problem I'm trying to solve is this:

container with its own user namespace and CAP_SYS_IMMUTABLE should be able
to use chattr on all files witch this container has access to.

Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.

With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
container.

The first solution got its user namespace from the currently running process
and the second gets its user namespace from the currently opened inode.

So what would be the best solution in this case?

I'd suggest adding a mount option like fs_owner_uid that names a uid
that owns, in the sense of having unlimited access to, a filesystem.
Then anyone with caps on a namespace owned by that uid could do
whatever.

Eric?

--Andy

The most obvious problem I can think of with "do whatever" is that this
will likely include mknod of char and block devices which you can then
chown/chmod as you wish and use to access any devices on the system from
an unprivileged container.
This can however be mitigated by using the devices cgroup controller.

Or 'nodev'. setuid/setgid may have the same problem, too.

Implementing something like this would also make CAP_DAC_READ_SEARCH
and CAP_DAC_OVERRIDE work.

Arguably it should be impossible to mount such a thing in the first
place without global privilege.


You also probably wouldn't want any unprivileged user from the host to
find a way to access that mounted filesytem but so long as you do the
mount in a separate mountns and don't share uids between the host and
the container, that should be fine too.

This part should be a nonissue -- an unprivileged user who has the
right uid owns the namespace anyway, so this is the least of your
worries.

--Andy

It should be a nonissue so long as we make sure that a file owned by a
uid outside the scope of the container may not be changed even though
fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
a shell and anyone who can see the fs from the host will be getting a
root shell (assuming said file is owned by the host's uid 0).

So that's restricting slightly what "do whatever" would do in this case.


In my case I give an LVM volume to each container and limit the container to only this block device using the devices cgroup.
So the inode_capable() fix worked like a charm for me.
The container can not see any filesystem other then its own.
And I have another patch for my kernel that prohibits setns from cgroup other then / which prevents programs from the container from getting out. clone() can be used to create new namespaces but can not be used to attach to already created namespaces.

Marian


--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman@xxxxxxxxxx
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/