Re: [RFC] perf_events: support for uncore a.k.a. nest units

From: Peter Zijlstra
Date: Thu Jan 28 2010 - 05:58:17 EST


On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote:
> On 1/27/2010 2:28 AM, Ingo Molnar wrote:
> >
> > * Corey Ashford<cjashfor@xxxxxxxxxxxxxxxxxx> wrote:
> >
> >> On 1/21/2010 11:13 AM, Corey Ashford wrote:
> >>>
> >>>
> >>> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
> >>>>
> >>>> * Corey Ashford<cjashfor@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>>> I really think we need some sort of data structure which is passed
> >>>> >from the
> >>>>> kernel to user space to represent the topology of the system, and give
> >>>>> useful information to be able to identify each PMU node. Whether this is
> >>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
> >>>>> really matter much, but it needs to be something that can be parsed
> >>>>> relatively easily and *contains just enough information* for the user
> >>>>> to be
> >>>>> able to correctly choose PMUs, and for the kernel to be able to
> >>>>> relate that
> >>>>> back to actual PMU hardware.
> >>>>
> >>>> The right way would be to extend the current event description under
> >>>> /debug/tracing/events with hardware descriptors and (maybe) to
> >>>> formalise this
> >>>> into a separate /proc/events/ or into a separate filesystem.
> >>>>
> >>>> The advantage of this is that in the grand scheme of things we
> >>>> _really_ dont
> >>>> want to limit performance events to 'hardware' hierarchies, or to
> >>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
> >>>> fundamentally limiting) object enumeration.
> >>>>
> >>>> We want a unified, logical enumeration of all events and objects that
> >>>> we care
> >>>> about from a performance monitoring and analysis point of view, shaped
> >>>> for the
> >>>> purpose of and parsed by perf user-space. And since the current event
> >>>> descriptors are already rather rich as they enumerate all sorts of
> >>>> things:
> >>>>
> >>>> - tracepoints
> >>>> - hw-breakpoints
> >>>> - dynamic probes
> >>>>
> >>>> etc., and are well used by tooling we should expand those with real
> >>>> hardware
> >>>> structure.
> >>>
> >>> This is an intriguing idea; I like the idea of generalizing all of this
> >>> info into one structure.
> >>>
> >>> So you think that this structure should contain event info as well? If
> >>> these structures are created by the kernel, I think that would
> >>> necessitate placing large event tables into the kernel, which is
> >>> something I think we'd prefer to avoid because of the amount of memory
> >>> it would take. Keep in mind that we need not only event names, but event
> >>> descriptions, encodings, attributes (e.g. unit masks), attribute
> >>> descriptions, etc. I suppose the kernel could read a file from the file
> >>> system, and then add this info to the tree, but that just seems bad. Are
> >>> there existing places in the kernel where it reads a user space file to
> >>> create a user space pseudo filesystem?
> >>>
> >>> I think keeping event naming in user space, and PMU naming in kernel
> >>> space might be a better idea: the kernel exposes the available PMUs to
> >>> user space via some structure, and a user space library tries to
> >>> recognize the exposed PMUs and provide event lists and other needed
> >>> info. The perf tool would use this library to be able to list available
> >>> events to users.
> >>>
> >>
> >> Perhaps another way of handing this would be to have the kernel dynamically
> >> load a specific "PMU kernel module" once it has detected that it has a
> >> particular PMU in the hardware. The module would consist only of a data
> >> structure, and a simple API to access the event data. This way, only only
> >> the PMUs that actually exist in the hardware would need to be loaded into
> >> memory, and perhaps then only temporarily (just long enough to create the
> >> pseudo fs nodes).
> >>
> >> Still, though, since it's a pseudo fs, all of that event data would be
> >> taking up kernel memory.
> >>
> >> Another model, perhaps, would be to actually write this data out to a real
> >> file system upon every boot up, so that it wouldn't need to be held in
> >> memory. That seems rather ugly and time consuming, though.
> >
> > I dont think memory consumption is a problem at all. The structure of the
> > monitored hardware/software state is information we _want_ the kernel to
> > provide, mainly because there's no unified repository for user-space to get
> > this info from.
> >
> > If someone doesnt want it on some ultra-embedded box then sure a .config
> > switch can be provided to allow it to be turned off.
> >
> > Ingo
>
> Ok, just so that we quantify things a bit, let's say I have 20 different types
> of PMUs totalling 2000 different events, each of which has a name and text
> description, averaging 300 characters. Along with that, there's let's say 4
> 64-bit words of metadata per event describing encoding, which attributes apply
> to the event, and any other needed info. I don't know how much memory each
> pseudo fs node takes up. Let me guess and say 128 bytes for each event node
> (the amount taken for the PMU nodes would be negligible compared with the event
> nodes).
>
> So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.
>
> Let's assume that the correct event module can be loaded dynamically, so that we
> don't need to have all of the possible event sets for a particular arch kernel
> build.
>
> Any opinions on whether allocating this amount of kernel memory would be
> acceptable? It seems like a lot of kernel memory to me, but I come from an
> embedded systems background. Granted, most systems are going to use a fraction
> of that amount of memory (<100KB) due to having far fewer PMUs and therefore
> fewer distinct event types.
>
> There's at least one more dimension to this. Let's say I have 16 uncore PMUs
> all of the same type, each of which has, for example 8 events. As a very crude
> pseudo fs, let's say we have a structure like this:
>
>
> /sys/devices/pmus/
> uncore_pmu0/
> event0/ (path name to here is the name of the pmu and event)
> description (file)
> applicable_attributes (file)
> event1/
> description
> applicable_attributes
> event2/
> ...
> event7/
> ...
> uncore_pmu1/
> event0/
> description
> applicable_attributes
> ...
> ...
> uncore_pmu15/
> ...

I really don't like this. The the cpu->uncore map is fixed by the
topology of the machine, which is already available in /sys some place.

Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or
something like that. We can start with 2 generic events for that type,
local/remote memory accesses and take it from there.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/