Re: [PATCH] [RFC] syscalls,x86: Add execveat() system call (v2)

From: Al Viro
Date: Thu Aug 02 2012 - 06:30:27 EST


On Thu, Aug 02, 2012 at 10:14:53AM +0100, Meredydd Luff wrote:
> On Thu, Aug 2, 2012 at 7:55 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >> This means you need an x32 version of the function -- execve
> >> unfortunately is one of the few system calls which require a special x32
> >> version (although it's a simple wrapper around sys32_execve). See
> >> sys_x32_execve.
> >
> > I *really* strongly object to doing that thing before we sanitize the
> > situation with sys_execve().
>
> "That thing" = "creating an x32 entry stub", or "merging execveat() at all"?
>
> (snip)
> > The thing is, there's essentially no reason to have more than one
> > implementation. What they are (badly) doing is "we need to find
> > pt_regs to pass to do_execve(), the thing we are after has to be near
> > our stack frame, so let's try to get to it that way".
>
> Hang on...it's not just sys_execve that fits that description, is it?
> You seem to be describing every call that needs a pt_regs parameter,
> which at a glance is anything with a stub_ or PTREGSCALL in
> arch/x86/kernel/entry_{32,64}.S. That's: clone, fork, vfork,
> sigaltstack, iopl, execve, sigreturn, rt_sigreturn, vm86, vm86old.
> Most of those are handled by a common PTREGSCALL macro, but there are
> a few that get special treatment (different set on each arch - on
> x86-64 it's execve and rt_sigreturn ; on i386 it's just clone).
>
> Is there's something special about execve in particular, or do you
> want to overhaul all the ptregscalls?

You are looking at that from the wrong side; it's not that this
set of syscalls on x86 has the same kind of wrapper - it's that
on different architectures the kludges are seriously different
and fairly brittle. sigaltstack(), BTW, doesn't need to be
arch-specific at all - if you check what its pt_regs argument
is used for, we just need something like current_user_stack_pointer()
and that's it. It's a different patch series, anyway.

There's a difference between "here's the syscall implementation,
here's its hookup for x86, let other architectures update their
syscall tables" and "here's the implementation in arch/x86 and its
hookup for x86; let other architectures port it in whatever way
they need". The latter is a recipe for breakage - we'd been
through that kind of story quite a few times.

Out of that set, vm86/vm86old/iopl are genuine x86-only syscalls.
sigreturn and rt_sigreturn are weird, subtle and, sadly,
unmergable - syscall restart prevention logics is different
enough to make it not feasible at the moment. It's a serious
source of PITA, BTW - quite a few of those had the same bugs,
years after they'd been discovered and fixed on a subset of
architectures. Been there, fixed some of those, the latest
batch this April... fork/vfork/clone is an interesting
question - IIRC, there are seriously subtle issues of some
sort on sparc, but I don't remember details right now.

Really, go and grep for do_execve; most of the callers will
be in execve(2) implementations. Look at them carefully -
while some get pt_regs from some wrapper, there's a bunch
that does that manually in C. "Ugly" doesn't even begin
to describe what's being done...

FWIW, I've just pushed (completely untested) arm and alpha
parts of what I described into signal.git#execve2; x86 is
next. Note that after that sys_execve() is identical on
converted architectures and can be merged; ditto for
kernel_execve(). After I do x86 counterpart, I'll
take those guys to fs/exec.c under ifdef for new __ARCH_HAS_...
(and define it on already converted ones, obviously).
Then your patch goes there, except that implementation
gets put into fs/exec.c, under the same ifdef. And with
current_pt_regs() used instead of the extra argument,
of course. From that point on it can be used on any converted
architecture. And conversion consists of
* providing ret_from_kernel_execve() that would
put pt_regs where they belong and return to userland.
* providing current_pt_regs(), with default being
just task_pt_regs(current).
* defining that new __ARCH_HAS_...
* removing sys_execve()/kernel_execve() implementations;
the ones in fs/exec.c will work just fine.
Can be done at leisure, architecture by architecture. It's
obviously the next cycle fodder - we have only a couple of
days left of this merge window, anyway.

I really think that this pair of primitives is the right
way to deal with execve mess.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/