Re: [kernel-hardening] Re: [PATCH] move RLIMIT_NPROC check from set_user() to do_execve_common()

From: Solar Designer
Date: Thu Jul 21 2011 - 08:48:52 EST

On Thu, Jul 21, 2011 at 02:09:36PM +1000, NeilBrown wrote:
> On Fri, 15 Jul 2011 15:54:43 -0400 Stephen Smalley <sds@xxxxxxxxxxxxx> wrote:
> > On Fri, 2011-07-15 at 19:26 +0400, Vasiliy Kulikov wrote:
> > > On Fri, Jul 15, 2011 at 09:58 -0400, Stephen Smalley wrote:
> > > > Does this have implications for Android's zygote model? There you have
> > > > a long running uid 0 / all caps process (the zygote), which forks itself
> > > > upon receiving a request to spawn an app and then calls
> > >
> > > > setgroups();
> > > > setrlimit(); setgid(); setuid();
> > >
> > > Is RLIMIT_NPROC forced in your model and setuid() is expected to fail
> > > because of NPROC exceeding? If no, then it is not touched at all.
> >
> > I don't know what their intent is. But it is an example of a case where
> > moving the enforcement from setuid() to a subsequent execve() causes the
> > check to never get applied. As to whether or not they care, I don't
> > know. An app that calls fork() repeatedly will still be stopped, but an
> > app that repeatedly connects to the zygote and asks to spawn another
> > instance of itself would be unlimited.
> >
> > OTOH, the current RLIMIT_NPROC check and lack of setuid() error checking
> > has been a repeated issue for Android.
> So where does this leave us? Between a rock and a hard place?

Maybe. I just took a look at Android's Zygote code, as found via Google
Code Search. This appears to be the relevant place:

As we can see, Stephen's description of the sequence of calls is indeed
correct. Also, the return value from setuid() is checked here. :-)

As to which rlimits are actually set, I don't know. This appears to be
configured at a much higher level:

--rlimit=r,c,m tuple of values for setrlimit() call.

I have no idea whether this --rlimit argument is ever supplied and with
what settings. The intent of supporting it could well be other than
making use of RLIMIT_NPROC specifically.

> It says to me that moving the check from set_user to execve is simply
> Wrong(TM). It may be convenient and do TheRightThing in certain common
> cases, but it also can do the Wrong thing in other cases and I don't think
> that is an acceptable trade off.

I disagree. There's nothing wrong in having the check on execve(),
especially not if we combine it with a check of your proposed flag set
on setuid() (only fail execve() when the flag is set and RLIMIT_NPROC is
still or again exceeded).

> Having setuid succeed when it should fail is simply incorrect.

As far as I'm aware, no standard says that setuid() should fail if it
would exceed RLIMIT_NPROC for the target user. There's a notion of
"appropriate privileges", but what these are is implementation-defined
and it was hardly meant to include rlimits.

> The problem - as we all know - is that user space doesn't always check error
> status properly. If we were to look for precedent I would point to SIGPIPE.
> The only reason for that to exist is because programs don't always check that
> a "write" succeeds so we have a mechanism to kill off processes that don't
> check the error status and keep sending.
> I would really like to apply that to this problem ... but that has already
> been suggested (years ago) and found wanting. Maybe we can revisit it?
> The value of the SIGPIPE approach (whether it is SIGPIPE or SIGXCPU or
> SIGVEC ... if only there were a SIGXNPROC) is that the program remains in
> complete control. It can find out if the NPROC limit has been exceeded at
> the "right" time.

I don't mind having setuid() signal (and by default actually kill) a
process if it would exceed RLIMIT_NPROC. However, I doubt that others
will agree. BTW, as I told Vasiliy on the kernel-hardening list, I
think we should revisit the "can't happen" memory allocation failure in
set_user() _after_ we have dealt with the RLIMIT_NPROC issue. I would
support the killing of process with SIGKILL or SIGSEGV (as opposed to
return -EAGAIN) on that "can't happen" condition (which might become
possible in a further revision of the code, so better safe than sorry).
Let's actually revisit this later, after having the most important fix

> The only other solution that I can think of which isn't "Wrong(TM)" is my
> first patch which introduced PF_SETUSER_FAILED.
> With this patch setuid() still fails if it should, so the calling process
> still remains in control. But if it fails to exercise that control, the
> kernel steps in.
> Vasiliy didn't like that because it allows a process to ignore the setuid
> failure, perform some in-process activity as root when expecting it to be as
> some-other-user, and only fails when execve is attempted - possibly too late.

I am with Vasiliy on this.

> Against this I ask: what exactly is our goal here?
> Is it to stop all possible abuses? I think not. That is impossible.
> Is it to stop certain known and commonly repeated errors? I think so. That
> is clearly valuable.

Not checking the return value from setuid() and proceeding to do other
work is a known and commonly repeated error. As to whether it is also
common for that other work to involve risky syscalls other than execve(),
I expect that it is, although I did not research this.

> We know (Thanks to Solar Designer's list) that unchecked setuid followed by
> execve is a commonly repeated error, so trapping that can be justified.
> Do we know that accessing the filesystem first is a commonly repeated error?
> If not, there is no clear motive to deal with that problem.
> If, however, it is then maybe a check for PF_SETUSER_FAILED in
> inode_permission() would be the right thing.
> Or maybe we just set PF_SETUSER_FAILED and leave it up to some security
> module to decide what to disable or report when that is set?

I feel that we'd be inventing something more complicated yet worse than
simply moving the check would be.

Here's my current proposal:

1. Apply Vasiliy's patch to move the RLIMIT_NPROC check from setuid() to
execve(), optionally enhanced with setting PF_SETUSER_FAILED on
would-be-failed setuid() and checking this flag in execve() (in addition
to repeating the RLIMIT_NPROC check).

2. With a separate patch, add a prctl() to read the PF_SETUSER_FAILED flag.
Android will be able to use this if it wants to.

Yes, this might break RLIMIT_NPROC for Android (I wrote "might" because
I have no idea if it actually sets that specific limit or not) until it
learns to use the new prctl(). But I think that's fine, and it is not a
reason for us to introduce more complexity into the kernel, yet make our
security hardening change more limited. There was never a guarantee (in
any standard or piece of documentation) that setuid() would fail on
exceeding RLIMIT_NPROC, and the Android/Zygote code might not actually
rely on that anyway (there's no clear indication that it does;
RLIMIT_NPROC is not in the code, nor is it mentioned in any comment).

> In short: I don't think there can be a solution that is both completely
> correct and completely safe. I would go for "as correct as possible" with
> "closes common vulnerabilities".

Maybe, and if so I think that one I proposed above falls in this
category as well, but it closes more vulnerabilities (and/or does so
more fully).


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at