Re: [PATCH] User namespace: don't allow sysctl in non-init user ns

From: Serge E. Hallyn
Date: Wed Sep 21 2011 - 09:15:22 EST


Quoting Miquel van Smoorenburg (mikevs@xxxxxxxxxx):
> On Thu, 2011-09-15 at 14:48 -0500, Serge E. Hallyn wrote:
> > sysctl.c has its own custom uid check, which is not user namespace
> > aware. As discovered by Richard, that allows root in a container
> > privileged access to set all sysctls.
> >
> > To fix that, just refuse access if current is not in init_user_ns. We
> > may at some point want to relax that check so that some sysctls are
> > allowed - for instance dmesg_restrict when syslog is containerized.
> >
> > Signed-off-by: Serge Hallyn <serge.hallyn@xxxxxxxxxxxxx>
> > Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> > Cc: Vasiliy Kulikov <segoon@xxxxxxxxxxxx>
> > Cc: richard@xxxxxx
> > ---
> > kernel/sysctl.c | 2 ++
> > 1 files changed, 2 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 11d65b5..f2b42e2 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1697,6 +1697,8 @@ void register_sysctl_root(struct ctl_table_root *root)
> >
> > static int test_perm(int mode, int op)
> > {
> > + if (current_user_ns() != &init_user_ns)
> > + return -EACCES;
> > if (!current_euid())
> > mode >>= 6;
> > else if (in_egroup_p(0))
>
> I haven't tested it, but it looks like this denies access to /proc/sys
> completely, right ?

True. Good point.

> Wouldn't it be better to make access read-only ?

It'd be better, yes. For the moment we're trying to focus on making sure
there are no leaks out of the user namespace, so that then we can start
relaxing. But I think you're right, this is easy enough to do right
from the start.

> For example glibc reads from several /proc/sys/... files for sysconf()
> and pathconf(). That would fail with this patch I think ?
>
> Something like
>
> --- a/kernel/sysctl.c.orig 2011-06-24 00:24:26.000000000 +0200
> +++ b/kernel/sysctl.c 2011-09-21 11:40:42.961291629 +0200
> @@ -1892,10 +1892,15 @@
>
> static int test_perm(int mode, int op)
> {
> - if (!current_euid())
> - mode >>= 6;
> - else if (in_egroup_p(0))
> - mode >>= 3;
> + if (current_user_ns() != &init_user_ns) {
> + if (op & MAY_WRITE)
> + return -EACCES;
> + } else {
> + if (!current_euid())
> + mode >>= 6;
> + else if (in_egroup_p(0))
> + mode >>= 3;
> + }
> if ((op & ~mode & (MAY_READ|MAY_WRITE|MAY_EXEC)) == 0)
> return 0;
> return -EACCES;

How about the following, slightly changed? It tries to more precisely
follow the rule that access from another (non-ancestor) user-ns is
checked as access from the overflow userid (-1).

if (current_user_ns() == &init_user_ns) {
if (!current_euid())
mode >>= 6;
else if (in_egroup_p(0))
mode >>= 3;
}
if ((op & ~mode & (MAY_READ|MAY_WRITE|MAY_EXEC)) == 0)
return 0;
return -EACCES;

Does that make sense? Yes it does mean that any world-writeable
files are writeable from a container, but those are the rules
we'd earlier defined. (File creation (directory write) has to be
a special exception only because we don't have a valid owner to
assign to the new file).

Thanks for pointing this out!

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/