Re: [PATCH RFC] sched: add notifier for process migration

From: Jason Baron
Date: Wed Oct 14 2009 - 10:43:39 EST


On Wed, Oct 14, 2009 at 11:26:10AM +0200, Peter Zijlstra wrote:
> On Wed, 2009-10-14 at 09:05 +0200, Ingo Molnar wrote:
> > * Jeremy Fitzhardinge <jeremy@xxxxxxxx> wrote:
> >
> > > @@ -1981,6 +1989,12 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > > #endif
> > > perf_swcounter_event(PERF_COUNT_SW_CPU_MIGRATIONS,
> > > 1, 1, NULL, 0);
> > > +
> > > + tmn.task = p;
> > > + tmn.from_cpu = old_cpu;
> > > + tmn.to_cpu = new_cpu;
> > > +
> > > + atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
> >
> > We already have one event notifier there - look at the
> > perf_swcounter_event() callback. Why add a second one for essentially
> > the same thing?
> >
> > We should only put a single callback there - a tracepoint defined via
> > TRACE_EVENT() - and any secondary users can register a callback to the
> > tracepoint itself.
> >
> > There's many similar places in the kernel - with notifier chains and
> > also with a need to get tracepoints there. The fastest (and most
> > consistent) solution is to add just a single event callback facility.
>
> But that would basically mandate tracepoints to be always enabled, do we
> want to go there?
>
> I don't think the overhead of tracepoints is understood well enough,
> Jason you poked at that, do you have anything solid on that?
>

Currently, the cost of the tracepoint is the global memory read, and
compare, and then a jump. On x86 systems that I've tested this can average
anywhere b/w 40 - 100 cycles per tracepoints. Plus, there is the
icache overhead of the extra instructions that we skip over. I'm not
sure how to measure that beyond looking at their size.

I've proposed a 'jump label' set of patches, which essentially hard
codes a jump around the disabled code (avoiding the memory reference).
However, this introduces a high 'write' cost in that we code patch the
jmp to a 'jmp 0' to enable the code.

Along with this optimization I'm also looking into a method for moving
the disabled text to a 'cold' text section, to reduce the icache
overhead. Using these techniques we can reduce the disabled case to
essentially a couple of cycles per tracepoint.

In this case, where the tracepoint is always on, we wouldn't want to
move the tracepoint text to a cold section. Thus, I could introduce a
default enabled/disabled bias to the tracepoint.

However, in introducing such a feature, we are essentially forcing an
always on, or always off usage pattern, since the switch cost is high.
So I want to be careful not limit usefullness of tracepoints with such
an optimization.

thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/