Re: [RFC] perf: perf record sets inherit by default

From: Stephane Eranian
Date: Mon May 17 2010 - 10:25:38 EST

On Tue, May 11, 2010 at 4:48 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Tue, 2010-05-11 at 16:04 +0200, Stephane Eranian wrote:
>> Hi,
>> I am confused by the inheritance cmd line option of perf record:
>> $ perf record -h
>> Âusage: perf record [<options>] [<command>]
>> Â Â or: perf record [<options>] -- <command> [<options>]
>> Â Â -e, --event <event> Â event selector. use 'perf list' to list
>> available events
>> Â Â Â Â --filter <filter>
>> Â Â Â Â Â Â Â Â Â Â Â Â Â event filter
>> Â Â -p, --pid <n> Â Â Â Â record events on existing process id
>> Â Â -t, --tid <n> Â Â Â Â record events on existing thread id
>> Â Â -r, --realtime <n> Â Âcollect data with this RT SCHED_FIFO priority
>>   -R, --raw-samples   collect raw sample records from all opened counters
>>   -a, --all-cpus    Âsystem-wide collection from all CPUs
>>   -A, --append     Âappend to the output file to do incremental profiling
>> Â Â -C, --profile_cpu <n>
>> Â Â Â Â Â Â Â Â Â Â Â Â Â CPU to profile on
>>   -f, --force      overwrite existing data file (deprecated)
>>   -c, --count      event period to sample
>> Â Â -o, --output <file> Â output file name
>>   -i, --inherit     child tasks inherit counters
>> This leads to believe that by default inheritance in children is off.
>> However, builtin-record.c says:
>> static bool           inherit             =  true;
>> If that's the case, what's the point of the -i option?
> Right, I think we should invert that, does --no-inherit work?
>> Another side effect of inheritance is that in per-thread mode,
>> perf creates as many "sessions" as you have CPUs. So
>> on a 16-way processor, sampling on cycles, perf creates
>> 16 events and 16 x 2-page sampling buffers. That's a lot of
>> resources consumed if I am just interested in monitoring
>> a single-threaded workload.
> Right, but I think the default of inherit is right, and once you do that
> you basically have to do the per-task-per-cpu thing, otherwise your
> fancy 16-way will start spending most of its time in cacheline bounces.
In that case, don't you think you should also ensure that the buffer is
allocated on the NUMA node of the designated per-thread-per-cpu?
I don't think it is the case today.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at