Re: [PATCH 08/12] add trace events for each syscall entry/exit

From: Frederic Weisbecker
Date: Tue Aug 25 2009 - 13:31:21 EST


On Tue, Aug 25, 2009 at 06:59:14PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2009 at 12:20:04PM -0400, Mathieu Desnoyers wrote:
> > * Hendrik Brueckner (brueckner@xxxxxxxxxxxxxxxxxx) wrote:
> > > On Tue, Aug 25, 2009 at 04:15:49PM +0200, Frederic Weisbecker wrote:
> > > > On Tue, Aug 25, 2009 at 02:50:27PM +0200, Hendrik Brueckner wrote:
> > > > > There are at least two scenarios where syscall_get_nr() can return -1:
> > > > >
> > > > > 1. For example, ptrace stores an invalid syscall number, and thus,
> > > > > tracing code resets it.
> > > > > (see do_syscall_trace_enter in arch/s390/kernel/ptrace.c)
> > > > >
> > > > > 2. The syscall_regfunc() (kernel/tracepoint.c) sets the TIF_SYSCALL_FTRACE
> > > > > (now: TIF_SYSCALL_TRACEPOINT) flag for all threads which includes
> > > > > kernel threads.
> > > > > However, the ftrace selftest triggers a kernel oops when testing syscall
> > > > > trace points:
> > > > > - The kernel thread is started as ususal (do_fork()),
> > > > > - tracing code sets TIF_SYSCALL_FTRACE,
> > > > > - the ret_from_fork() function is triggered and starts
> > > > > ftrace_syscall_exit() with an invalid syscall number.
> > > >
> > > >
> > > >
> > > > I wonder if there is any way to identify such situation...?
> > > For the second case, it might be an option to avoid setting the
> > > TIF_SYSCALL_FTRACE flag for kernel threads.
> > >
> > > Kernel threads have task_struct->mm set to NULL.
> > > (Thanks to Heiko for that hint ;-)
> > >
> > > The idea is then to check the mm field in syscall_regfunc() and
> > > set the flag accordingly.
> > >
> > > However, I think the patch is an optional add-on becase checking
> > > the syscall number is still required for case 1).
> > >
> > > ---
> > > kernel/tracepoint.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > --- a/kernel/tracepoint.c
> > > +++ b/kernel/tracepoint.c
> > > @@ -593,7 +593,9 @@ void syscall_regfunc(void)
> > > if (!sys_tracepoint_refcount) {
> > > read_lock_irqsave(&tasklist_lock, flags);
> > > do_each_thread(g, t) {
> > > - set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > > + /* Skip kernel threads. */
> > > + if (t->mm)
> > > + set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> >
> > Uh ? kernel threads can invoke a system call. There are rare places
> > where kernel code actually invoke system calls. I don't see why we
> > should not deal with them.
>
>
>
> Yeah they do, but they don't use the sysenter path, they call the
> syscall helpers directly, such as do_fork() or things like that.
>
> The syscall tracepoints are set in the sysenter/sysexit path, then
> it's no use to trace the kernel threads, it doesn't have any effect,
> except random results in case of fork() calls, because we take
> the ret_from_fork() path that also ends up to trace_sys_exit()
> if the TIF_SYSCALL_TRACEPOINT thing is set, leading to such
> asymetric tracing.
>
> Kernel threads use syscalls toward wrappers such as create_thread().
> So instead, statically defined tracepoints in create_thread() and such
> other syscall wrappers for kernel threads seem more valuable, hmm?
>
>
> > Moreover, the problem you face is more general: if we set the
> > TIF_SYSCALL_FTRACE flag of a standard thread right in the middle of its
> > system call, x86_64 will cause the syscall exit to execute by re-reading
> > the thread flags and run a syscall trace exit.
>
>
> Well, I don't think that's the problem. The issue here, if I understand
> correctly, is that kernel threads don't take the sysenter path, then never hit
> the trace_sys_enter() call. And usually they won't ever hit any
> trace_sys_exit() calls except in the fork() case, because we take
> the ret_from_fork() path, which lead to syscall exit tracing due
> to the TIF flags set.
>
> At this stage, the syscall number is supposed to be stored in orig_eax,
> but because the kernel thread hasn't called fork() through a syscall and
> has called do_fork() directly, the regs values have nothing that look
> like syscall parameters.



I mean, I don't know how look like orig_eax at this stage.

Looking at arch/x86/kernel/process_32.c:copy_thread()

childregs = task_pt_regs(p);
*childregs = *regs;
childregs->ax = 0;
childregs->sp = sp;

p->thread.sp = (unsigned long) childregs;
p->thread.sp0 = (unsigned long) (childregs+1);

p->thread.ip = (unsigned long) ret_from_fork;


sp will be the struct pt_regs * passed to syscall_trace_leave()
later.

ax has the result of the fork syscall -> 0 for the child.
What about orig_eax which has the syscall nr? It depends on the pareent
and I don't know what it has at this stage.

I haven't seen crashes in x86 with kernel threads tracing, may be because
orig_eax is set to a valid syscall nr (may be even fork nr).

Perhaps it's not the case in s390 ?

Anyway, tracing kernel threads syscalls only gives us the fork return,
so it's something me may want to drop and trace higher level kernel
thread syscall wrappers instead.

Moreover every kernel threads is created through a kthreadd fork if
I'm not wrong, then it wouldn't be an accurate thing for us to trace the
fork calls in kernel thread. Tracing higher level kernel thread managment
sounds more interesting, we would then know who really created the thread,
etc...


>
> I guess we don't need to take the sys_enter tracing path to have a sane
> orig_eax in the sys_exit tracing path (for non kernel threads).
> Though I'm not sure about that, I should check to be sure.
>
> > We could simply initialize the "saved system calls id" number to
> > something like -1, so that if we happen to return from a syscall that
> > did not get its id recorded at syscall entry, we know it because it's
> > not initialized.
> >
> > We would need to carefully put back the -1 value after clearing the
> > thread flag when we stop tracing too (while still holding a mutex).
> >
> > Mathieu
> >
> > > } while_each_thread(g, t);
> > > read_unlock_irqrestore(&tasklist_lock, flags);
> > > }
> > >
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/