Re: [PATCH 2/3] introduce intel_rapl driver

From: Zhang Rui
Date: Fri May 27 2011 - 04:28:45 EST


Hi, Peter,

On Thu, 2011-05-26 at 17:43 +0800, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 16:34 +0800, Zhang Rui wrote:
> > Introduce Intel RAPL driver.
> >
> > RAPL (running average power limit) is a new feature which provides mechanisms
> > to enforce power consumption limit, on some new processors.
> >
> > RAPL provides MSRs reporting the total amount of energy consumed
> > by the package/core/uncore/dram.
> > Further more, by using RAPL, OS can set a power bugdet in a certain time window,
> > and let Hardware to throttle the processor P/T-state to meet this enery limitation.
> >
> > Currently, we don't have the plan to support the RAPL power control,
> > but we do want to export the package/core/uncore/dram power consumption
> > information via perf tool first.
>
> Do note that perf is not the right API for those control bits. If you
> never plan to expose those, that's fine. If you do, you'll likely need a
> parallel API (your own device) for accessing that.

Agree.
I was thinking of registering RAPL as a platform device and set the
power limit via sysfs nodes.

> Please consider if
> using separate APIs for reading/writing this resource is what you want
> and mention these considerations in your future changelog.
>
okay. I'll do that.

> > Signed-off-by: Zhang Rui <rui.zhang@xxxxxxxxx>
> > ---
> > drivers/platform/x86/Kconfig | 8
> > drivers/platform/x86/Makefile | 1
> > drivers/platform/x86/intel_rapl.c | 368 ++++++++++++++++++++++++++++++++++++++
> > include/linux/perf_event.h | 4
> > 4 files changed, 381 insertions(+)
> >
> > Index: linux-2.6/drivers/platform/x86/Kconfig
> > ===================================================================
> > --- linux-2.6.orig/drivers/platform/x86/Kconfig
> > +++ linux-2.6/drivers/platform/x86/Kconfig
> > @@ -753,4 +753,12 @@ config SAMSUNG_LAPTOP
> > To compile this driver as a module, choose M here: the module
> > will be called samsung-laptop.
> >
> > +config INTEL_RAPL
> > + tristate "Intel RAPL Support"
> > + depends on X86
>
> Also very much depends on perf being there.
>
Agree.

> > + default y
> > + ---help---
> > + RAPL, AKA, Running Average Power Limit provides mechanisms to enforce
> > + power consumption limit.
>
> The enforce part seems dubious, perf is purely about observing state it
> doesn't enforce anything. Also this help text could do with expanding in
> general.
>
This help text is just a description of RAPL interface.
But you're right, I should be more specific about the CURRENT intel_rapl
driver status.

> > endif # X86_PLATFORM_DEVICES
> > Index: linux-2.6/drivers/platform/x86/Makefile
> > ===================================================================
> > --- linux-2.6.orig/drivers/platform/x86/Makefile
> > +++ linux-2.6/drivers/platform/x86/Makefile
> > @@ -42,3 +42,4 @@ obj-$(CONFIG_XO15_EBOOK) += xo15-ebook.o
> > obj-$(CONFIG_IBM_RTL) += ibm_rtl.o
> > obj-$(CONFIG_SAMSUNG_LAPTOP) += samsung-laptop.o
> > obj-$(CONFIG_INTEL_MFLD_THERMAL) += intel_mid_thermal.o
> > +obj-$(CONFIG_INTEL_RAPL) += intel_rapl.o
> > Index: linux-2.6/include/linux/perf_event.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/perf_event.h
> > +++ linux-2.6/include/linux/perf_event.h
> > @@ -107,6 +107,10 @@ enum perf_sw_ids {
> > PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6,
> > PERF_COUNT_SW_ALIGNMENT_FAULTS = 7,
> > PERF_COUNT_SW_EMULATION_FAULTS = 8,
> > + PERF_COUNT_SW_PKG_ENERGY = 9,
> > + PERF_COUNT_SW_CORE_ENERGY = 10,
> > + PERF_COUNT_SW_UNCORE_ENERGY = 11,
> > + PERF_COUNT_SW_DRAM_ENERGY = 12,
>
> Not going to happen, RAPL registers its own pmu (wrongly, see below),
> with that it (should) get its own perf_event_attr::type and thus should
> have its own ::config space, you do not get to pollute the
> PERF_TYPE_SOFTWARE config space.

> Currently there isn't a way to expose the events in sysfs, but we do
> want that, its mostly a matter of getting all involved parties to agree
> on a format and implementing it.
>
I talked with Lin Ming just now, and he said that it should work in this
way:
First, only one pmu for RAPL interfaces, with four different kinds of
events, pkg/core/uncore/dram,
and the sysfs I/F is:
/sys/bus/event_source/devices/rapl/---|---type
|---pkg
|---core
|---uncore
|---dram

to use it, users can issue something like:
perf stat -P rapl -e pkg/core/uncore/dram foo
so that event->attr.type equals rapl_pmu.type and event->attr.config
equals one of the rapl_domain_id.

This sounds good. I can rewrite the code to work in this way, but it
doesn't work for now, until both sysfs I/F and perf tool being ready,
right?

> > PERF_COUNT_SW_MAX, /* non-ABI */
> > };
> > Index: linux-2.6/drivers/platform/x86/intel_rapl.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/drivers/platform/x86/intel_rapl.c
>
> > +#define MSR_RAPL_POWER_UNIT 0x606
> > +
> > +/*
> > + * Platform specific RAPL Domains.
> > + * Note that PP1 RAPL Domain is supported on 062A only
> > + * And DRAM RAPL Domain is supported on 062D only
> > + */
>
> 0x62[AD] is useless, please use proper names.

> > +/* Package RAPL Domain */
> > +#define MSR_PKG_RAPL_POWER_LIMIT 0x610
> > +#define MSR_PKG_ENERGY_STATUS 0x611
> > +#define MSR_PKG_PERF_STATUS 0x613
> > +#define MSR_PKG_POWER_INFO 0x614
> > +
> > +/* PP0 RAPL Domain */
> > +#define MSR_PP0_POWER_LIMIT 0x638
> > +#define MSR_PP0_ENERGY_STATUS 0x639
> > +#define MSR_PP0_POLICY 0x63A
> > +#define MSR_PP0_PERF_STATUS 0x63B
> > +
> > +/* PP1 RAPL Domain, may reflect to uncore devices */
> > +#define MSR_PP1_POWER_LIMIT 0x640
> > +#define MSR_PP1_ENERGY_STATUS 0x641
> > +#define MSR_PP1_POLICY 0x642
> > +
> > +/* DRAM RAPL Domain */
> > +#define MSR_DRAM_POWER_LIMIT 0x618
> > +#define MSR_DRAM_ENERGY_STATUS 0x619
> > +#define MSR_DRAM_PERF_STATUS 0x61B
> > +#define MSR_DRAM_POWER_INFO 0x61C
> > +
> > +/* RAPL UNIT BITMASK */
> > +#define POWER_UNIT_OFFSET 0
> > +#define POWER_UNIT_MASK 0x0F
> > +
> > +#define ENERGY_UNIT_OFFSET 0x08
> > +#define ENERGY_UNIT_MASK 0x1F00
> > +
> > +#define TIME_UNIT_OFFSET 0x10
> > +#define TIME_UNIT_MASK 0xF000
>
> Are you sure? (x & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET == 0.
> You either want a mask of 0xF0000, or an offset of 0x0c.
>
oops. It's 0xF0000. sorry about that.

> > +static int rapl_pmu_pkg_event_init(struct perf_event *event);
> > +static int rapl_pmu_core_event_init(struct perf_event *event);
> > +static int rapl_pmu_uncore_event_init(struct perf_event *event);
> > +static int rapl_pmu_dram_event_init(struct perf_event *event);
> > +static void rapl_event_start(struct perf_event *event, int flags);
> > +static void rapl_event_stop(struct perf_event *event, int flags);
> > +static int rapl_event_add(struct perf_event *event, int flags);
> > +static void rapl_event_del(struct perf_event *event, int flags);
> > +static void rapl_event_read(struct perf_event *event);
> > +
> > +enum rapl_domain_id {
> > + RAPL_DOMAIN_PKG,
> > + RAPL_DOMAIN_PP0,
> > + RAPL_DOMAIN_PP1,
> > + RAPL_DOMAIN_DRAM,
> > + RAPL_DOMAIN_MAX
> > +};
> > +
> > +struct rapl_domain_msr {
> > + int limit;
> > + int status;
> > +};
> > +
> > +struct rapl_domain {
> > + enum rapl_domain_id domain_id;
> > + struct rapl_domain_msr msrs;
> > + struct pmu pmu;
> > + enum perf_sw_ids event_id;
> > + int valid;
> > +};
>
> You could use the rapl_domain_id as your ::config space.
>
>
> > +static unsigned int power_unit_divisor;
> > +static unsigned int energy_unit_divisor;
> > +static unsigned int time_unit_divisor;
> > +
> > +enum unit_type {
> > + POWER_UNIT,
> > + ENERGY_UNIT,
> > + TIME_UNIT
> > +};
> > +static u64 rapl_unit_xlate(enum unit_type type, u64 value, int action)
> > +{
> > + u64 divisor;
> > +
> > + switch (type) {
> > + case POWER_UNIT:
> > + divisor = power_unit_divisor;
> > + break;
> > + case ENERGY_UNIT:
> > + divisor = energy_unit_divisor;
> > + break;
> > + case TIME_UNIT:
> > + divisor = time_unit_divisor;
> > + break;
> > + default:
> > + return 0;
> > + };
> > +
> > + if (action)
> > + return value * divisor; /* value is from users */
> > + else
> > + return div64_u64(value, divisor); /* value is from MSR */
> > +}
>
> Please see the comment down by rapl_check_unit(), this is just too wrong
> to live.
>
> > +/* show the energy status, in Jelous */
> > +static int rapl_read_energy(struct rapl_domain *domain)
> > +{
> > + u64 value;
> > + u32 msr = domain->msrs.status;
> > +
> > + rdmsrl(msr, value);
> > + return rapl_unit_xlate(ENERGY_UNIT, value, 0);
> > +}
> > +
> > +static void rapl_event_update(struct perf_event *event)
> > +{
> > + s64 prev;
> > + u64 now;
> > + struct rapl_domain *domain = to_rapl_domain(event->pmu);
> > +
> > + now = rapl_read_energy(domain);
>
> So I had to get the Intel SDM because your driver lacks all useful
> information, and I learned that the RAPL status MSRs contain 32 bits.
>
> So you get those 32 bits, divide them by some number,
>
> > + prev = local64_xchg(&event->hw.prev_count, now);
> > + local64_add(now - prev, &event->count);
>
> And then expect that to work?
>
rapl_read_energy first reads energy status from MSR and then invokes
rapl_unit_xlate to translate it into Joules.
For example, on the laptop I tested, the energy unit bits is 0x10, which
means that the energy unit is 1/65536 Joule.
So I need to divide the value read from MSR by 65536 to calculate how
many Joules of energy are cost.

But this reveals a problem. If the task is scheduled out with energy
consumption less than 1 Joule, we failed to record it.

IMO, a new callback should be introduced so that I can save the MSR
value first and translate it to Joule when the task exits. Or just do
the translation in user space.

what do you think?

> I don't think so..
>
> > +}
> > +
> > +static void rapl_event_start(struct perf_event *event, int flags)
> > +{
> > + struct rapl_domain *domain = to_rapl_domain(event->pmu);
> > +
> > + local64_set(&event->hw.prev_count, rapl_read_energy(domain));
> > + perf_swevent_start_hrtimer(event);
> > +}
> > +
> > +static void rapl_event_stop(struct perf_event *event, int flags)
> > +{
> > + perf_swevent_cancel_hrtimer(event);
> > + rapl_event_update(event);
> > +}
>
> > +static int rapl_pmu_event_init(struct perf_event *event,
> > + enum rapl_domain_id id)
> > +{
> > + struct rapl_domain *domain = &(rapl_domains[id]);
> > +
> > + if (event->attr.type != PERF_TYPE_SOFTWARE)
> > + return -ENOENT;
> > +
> > + if (event->attr.config != domain->event_id)
> > + return -ENOENT;
> > +
> > + /* Do periodecal update every second */
> > + event->attr.freq = 1;
> > + event->attr.sample_period = 1;
> > +
> > + perf_swevent_init_hrtimer(event);
> > +
> > + return 0;
> > +}
>
> That's just wrong.. the reason you're wanting to have this timer is to
> avoid the RAPL MSRs from overflowing and you loosing offsets, right?
>
> But the above is actually forcing the event to create samples on a
> totally unrelated time base.
>
> RAPL should fail to create a sampling event since it doesn't have the
> capability to trigger overflow interrupts based on its events.
>
> If you want a timer, add one, but don't do this.
>
> If you expect you actually want to sample, use this event as part of a
> group and add a sampling event in there and use PERF_FORMAT_GROUP, Matt
> was working on patches to make perf-record capable of this.
>
perf stat doesn't support -g parameter.

BTW, as I need a per task hrtimer, can I make use of the
hw_perf_event.hrtimer in intel_rapl driver, without touching the perf
hrtimer interfaces?

> > +static int rapl_check_unit(void)
>
> Shouldn't that be called: rapl_init_unit()? You're not actually
> verifying anything, you're setting-up state.
>
Agree.

thanks,
rui
> > +{
> > + u64 output;
> > + u32 value;
> > +
> > + rdmsrl(MSR_RAPL_POWER_UNIT, output);
> > +
> > + /* energy unit: 1/enery_unit_divisor Joules */
> > + value = (output & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET;
> > + energy_unit_divisor = 1 << value;
> > +
> > + /* power unit: 1/power_unit_divisor Watts */
> > + value = (output & POWER_UNIT_MASK) >> POWER_UNIT_OFFSET;
> > + power_unit_divisor = 1 << value;
> > +
> > + /* time unit: 1/time_unit_divisor Seconds */
> > + value =(output & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET;
> > + time_unit_divisor = 1 << value;
>
> So you're saying these factors are powers-of-two, please look at
> rapl_unit_xlate and try again.
>
> +
> > + return 0;
> > +}
> > +
> > +static int __init intel_rapl_init(void)
> > +{
> > + enum rapl_domain_id id;
> > +
> > + /*
> > + * RAPL features are only supported on processors have a CPUID
> > + * signature with DisplayFamily_DisplayModel of 06_2AH, 06_2DH
> > + */
> > + if (boot_cpu_data.x86 != 0x06)
> > + return -ENODEV;
> > +
> > + if (boot_cpu_data.x86_model == 0x2A)
> > + rapl_domains[RAPL_DOMAIN_PP1].valid = 1;
> > + else if (boot_cpu_data.x86_model == 0x2D)
> > + rapl_domains[RAPL_DOMAIN_DRAM].valid = 1;
> > + else
> > + return -ENODEV;
>
> Names please, again 06_2[AD] is useless we could have surmised that by
> reading the code, nobody knows which part that is.
>
> a += 4; /* increment by 4 */
>
> quality comments here.
>
> > + if (rapl_check_unit())
> > + return -ENODEV;
> > +
> > + for(id = 0; id < RAPL_DOMAIN_MAX; id++)
> > + if (rapl_domains[id].valid)
> > + perf_pmu_register(&(rapl_domains[id].pmu), rapl_domains[id].pmu.name, PERF_TYPE_SOFTWARE);
>
> Uhm, hell no!, you get to use type = -1.
>
> > + return 0;
> > +}
> > +




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/