Re: [RFC] perf: perf record sets inherit by default

From: Stephane Eranian
Date: Mon May 17 2010 - 10:25:38 EST


On Tue, May 11, 2010 at 4:48 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Tue, 2010-05-11 at 16:04 +0200, Stephane Eranian wrote:
>> Hi,
>>
>>
>> I am confused by the inheritance cmd line option of perf record:
>>
>> $ perf record -h
>> Âusage: perf record [<options>] [<command>]
>> Â Â or: perf record [<options>] -- <command> [<options>]
>>
>> Â Â -e, --event <event> Â event selector. use 'perf list' to list
>> available events
>> Â Â Â Â --filter <filter>
>> Â Â Â Â Â Â Â Â Â Â Â Â Â event filter
>> Â Â -p, --pid <n> Â Â Â Â record events on existing process id
>> Â Â -t, --tid <n> Â Â Â Â record events on existing thread id
>> Â Â -r, --realtime <n> Â Âcollect data with this RT SCHED_FIFO priority
>>   -R, --raw-samples   collect raw sample records from all opened counters
>>   -a, --all-cpus    Âsystem-wide collection from all CPUs
>>   -A, --append     Âappend to the output file to do incremental profiling
>> Â Â -C, --profile_cpu <n>
>> Â Â Â Â Â Â Â Â Â Â Â Â Â CPU to profile on
>>   -f, --force      overwrite existing data file (deprecated)
>>   -c, --count      event period to sample
>> Â Â -o, --output <file> Â output file name
>>   -i, --inherit     child tasks inherit counters
>>
>> This leads to believe that by default inheritance in children is off.
>>
>> However, builtin-record.c says:
>>
>> static bool           inherit             =  true;
>>
>> If that's the case, what's the point of the -i option?
>
> Right, I think we should invert that, does --no-inherit work?
>
>> Another side effect of inheritance is that in per-thread mode,
>> perf creates as many "sessions" as you have CPUs. So
>> on a 16-way processor, sampling on cycles, perf creates
>> 16 events and 16 x 2-page sampling buffers. That's a lot of
>> resources consumed if I am just interested in monitoring
>> a single-threaded workload.
>
> Right, but I think the default of inherit is right, and once you do that
> you basically have to do the per-task-per-cpu thing, otherwise your
> fancy 16-way will start spending most of its time in cacheline bounces.
>
In that case, don't you think you should also ensure that the buffer is
allocated on the NUMA node of the designated per-thread-per-cpu?
I don't think it is the case today.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/