Re: [PATCH 6/7] exec: Move most of setup_new_exec into flush_old_exec

From: Eric W. Biederman
Date: Thu May 07 2020 - 17:54:58 EST


Kees Cook <keescook@xxxxxxxxxxxx> writes:

> On Tue, May 05, 2020 at 02:45:33PM -0500, Eric W. Biederman wrote:
>>
>> The current idiom for the callers is:
>>
>> flush_old_exec(bprm);
>> set_personality(...);
>> setup_new_exec(bprm);
>>
>> In 2010 Linus split flush_old_exec into flush_old_exec and
>> setup_new_exec. With the intention that setup_new_exec be what is
>> called after the processes new personality is set.
>>
>> Move the code that doesn't depend upon the personality from
>> setup_new_exec into flush_old_exec. This is to facilitate future
>> changes by having as much code together in one function as possible.
>
> Er, I *think* this is okay, but I have some questions below which
> maybe you already investigated (and should perhaps get called out in
> the changelog).

I intend to the following text to the changelog. At this point I
believe I have read through everything and nothing raises any concerns
for me:

--- text begin ---

To see why it is safe to move this code please note that effectively
this change moves the personality setting in the binfmt and the following
three lines of code after everything except unlocking the mutexes:
arch_pick_mmap_layout
arch_setup_new_exec
mm->task_size = TASK_SIZE

The function arch_pick_mmap_layout at most sets:
mm->get_unmapped_area
mm->mmap_base
mm->mmap_legacy_base
mm->mmap_compat_base
mm->mmap_compat_legacy_base
which nothing in flush_old_exec or setup_new_exec depends on.

The function arch_setup_new_exec only sets architecture specific
state and the rest of the functions only deal in state that applies
to all architectures.

The last line just sets mm->task_size and again nothing in flush_old_exec
or setup_new_exec depend on task_size.

--- text end ---

>>
>> Ref: 221af7f87b97 ("Split 'flush_old_exec' into two functions")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
>> ---
>> fs/exec.c | 85 ++++++++++++++++++++++++++++---------------------------
>> 1 file changed, 44 insertions(+), 41 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 8c3abafb9bb1..0eff20558735 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1359,39 +1359,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>> * undergoing exec(2).
>> */
>> do_close_on_exec(me->files);
>> - return 0;
>> -
>> -out_unlock:
>> - mutex_unlock(&me->signal->exec_update_mutex);
>> -out:
>> - return retval;
>> -}
>> -EXPORT_SYMBOL(flush_old_exec);
>> -
>> -void would_dump(struct linux_binprm *bprm, struct file *file)
>> -{
>> - struct inode *inode = file_inode(file);
>> - if (inode_permission(inode, MAY_READ) < 0) {
>> - struct user_namespace *old, *user_ns;
>> - bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP;
>> -
>> - /* Ensure mm->user_ns contains the executable */
>> - user_ns = old = bprm->mm->user_ns;
>> - while ((user_ns != &init_user_ns) &&
>> - !privileged_wrt_inode_uidgid(user_ns, inode))
>> - user_ns = user_ns->parent;
>>
>> - if (old != user_ns) {
>> - bprm->mm->user_ns = get_user_ns(user_ns);
>> - put_user_ns(old);
>> - }
>> - }
>> -}
>> -EXPORT_SYMBOL(would_dump);
>> -
>> -void setup_new_exec(struct linux_binprm * bprm)
>> -{
>> - struct task_struct *me = current;
>> /*
>> * Once here, prepare_binrpm() will not be called any more, so
>> * the final state of setuid/setgid/fscaps can be merged into the
>> @@ -1414,8 +1382,6 @@ void setup_new_exec(struct linux_binprm * bprm)
>> bprm->rlim_stack.rlim_cur = _STK_LIM;
>> }
>>
>> - arch_pick_mmap_layout(me->mm, &bprm->rlim_stack);
>> -
>> me->sas_ss_sp = me->sas_ss_size = 0;
>>
>> /*
>> @@ -1430,16 +1396,9 @@ void setup_new_exec(struct linux_binprm * bprm)
>> else
>> set_dumpable(current->mm, SUID_DUMP_USER);
>>
>> - arch_setup_new_exec();
>> perf_event_exec();
>
> What is perf expecting to be able to examine at this point? Does it want
> a view of things after arch_setup_new_exec()? (i.e. "final" TIF flags,
> mmap layout, etc.) From what I can, the answer is "no, it's just
> resetting counters", so I think this is fine. Maybe double-check with
> Steve?

I can't find anything in the perf code that depends on
arch_pick_mmap_layout or mm->task_size. So I don't have any concerns.
I have grepped through both kernel/events/ and arch/x86/events/ and
include/trace to double check and have nothing turned up.

I can't see the policy of where things will be allocated in the
memory map making any difference to perf.

Depending on what events actually are I can imagine then firing and
having issues as I can imagine an event to be just about anything
but I don't see a way to prevent that.

I do see perf disabling events that are based on addresses. I further
see perf enabling/disabling events that have already been computed. I
see perf treating exec effectively as a process scheduling out and in.

Then finally I see perf shutting itself down on suid exec, and
generating some final perf events. I have some concerns that is
a bit late, and that the test might not be quite right but nothing
particular to this change.

>> __set_task_comm(me, kbasename(bprm->filename), true);
>>
>> - /* Set the new mm task size. We have to do that late because it may
>> - * depend on TIF_32BIT which is only updated in flush_thread() on
>> - * some architectures like powerpc
>> - */
>> - me->mm->task_size = TASK_SIZE;
>> -
>> /* An exec changes our domain. We are no longer part of the thread
>> group */
>> WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
>> @@ -1467,6 +1426,50 @@ void setup_new_exec(struct linux_binprm * bprm)
>> * credentials; any time after this it may be unlocked.
>> */
>> security_bprm_committed_creds(bprm);
>
> Similarly for the LSM hook: is it expecting a post-arch-setup view? I
> don't see anything looking at task_size, TIF flags, or anything else;
> they seem to be just cleaning up from the old process being replaced, so
> against, I think this is okay.

Nothing at all with the mm. The LSM hooks close files and
muck with rlimits and signals, and tidy up their lsm state.

There are only 3 implementations apparmor, tomoyo and selinux
so it isn't too hard to read through them.

> Not visible in this patch, the following things how happen earlier,
> which I feel should maybe get called out in the changelog, with,
> perhaps, better justification than what I've got here:
>
> bprm->secureexec set/check (looks safe, since it depends on
> prepare_binprm()'s security_bprm_set_creds().
>
> rlim_stack.rlim_cur setting (safe, just needs to happen before
> arch_pick_mmap_layout())
>
> dumpable() check (looks safe, BINPRM_FLAGS_ENFORCE_NONDUMP depends on
> much earlier would_dump(), and uid/gid depend on earlier calls to
> prepare_binprm()'s bprm_fill_uid())
>
> __set_task_comm (looks safe, just dealing with the task name...)
>
> self_exec_id bump (looks safe, but I think -- it's still after uid
> setting)
>
> flush_signal_handlers() (looks safe -- nothing appears to depend on mm
> nor personality)

Agreed.

>> + return 0;
>> +
>> +out_unlock:
>> + mutex_unlock(&me->signal->exec_update_mutex);
>> +out:
>> + return retval;
>> +}
>> +EXPORT_SYMBOL(flush_old_exec);
>> +
>> +void would_dump(struct linux_binprm *bprm, struct file *file)
>> +{
>> + struct inode *inode = file_inode(file);
>> + if (inode_permission(inode, MAY_READ) < 0) {
>> + struct user_namespace *old, *user_ns;
>> + bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP;
>> +
>> + /* Ensure mm->user_ns contains the executable */
>> + user_ns = old = bprm->mm->user_ns;
>> + while ((user_ns != &init_user_ns) &&
>> + !privileged_wrt_inode_uidgid(user_ns, inode))
>> + user_ns = user_ns->parent;
>> +
>> + if (old != user_ns) {
>> + bprm->mm->user_ns = get_user_ns(user_ns);
>> + put_user_ns(old);
>> + }
>> + }
>> +}
>> +EXPORT_SYMBOL(would_dump);
>
> The diff helpfully decided this moved would_dump(). ;) Is it worth
> maybe just moviing it explicitly above flush_old_exec() to avoid this
> churn? I dunno.

Given the amount of review and testing that has been put in at
this point I don't think so.

>> +
>> +void setup_new_exec(struct linux_binprm * bprm)
>> +{
>> + /* Setup things that can depend upon the personality */
>
> Should this comment be above the function instead?

My experience has been that comments above functions unless they are in
full linuxdoc tend to be less well maintained than comments within the
function itself. So I don't think it is worth moving.x

>> + struct task_struct *me = current;
>> +
>> + arch_pick_mmap_layout(me->mm, &bprm->rlim_stack);
>> +
>> + arch_setup_new_exec();
>> +
>> + /* Set the new mm task size. We have to do that late because it may
>> + * depend on TIF_32BIT which is only updated in flush_thread() on
>> + * some architectures like powerpc
>> + */
>> + me->mm->task_size = TASK_SIZE;
>> mutex_unlock(&me->signal->exec_update_mutex);
>> mutex_unlock(&me->signal->cred_guard_mutex);
>> }
>> --
>> 2.20.1
>>
>
> So, as I say, I *think* this is okay, but I always get suspicious about
> reordering things in execve(). ;)
>
> So, with a bit larger changelog discussing what's moving "earlier",
> I think this looks good:

Please see above.

Eric