Re: [PATCH v1 1/2] perf/core: Use sysctl to turn on/off dropping leaked kernel samples

From: Jin, Yao
Date: Mon Jun 18 2018 - 02:55:42 EST




On 6/15/2018 7:36 PM, Mark Rutland wrote:
On Fri, Jun 15, 2018 at 06:03:22PM +0800, Jin Yao wrote:
When doing sampling, for example:

perf record -e cycles:u ...

On workloads that do a lot of kernel entry/exits we see kernel
samples, even though :u is specified. This is due to skid existing.

This might be a security issue because it can leak kernel addresses even
though kernel sampling support is disabled.

One patch "perf/core: Drop kernel samples even though :u is specified"
was posted in last year but it was reverted because it introduced a
regression issue that broke the rr-project, which used sampling
events to receive a signal on overflow. These signals were critical
to the correct operation of rr.

See '6a8a75f32357 ("Revert "perf/core: Drop kernel samples even
though :u is specified"")' for detail.

Now the idea is to use sysctl to control the dropping of leaked
kernel samples.

/sys/devices/cpu/perf_allow_sample_leakage:

0 - default, drop the leaked kernel samples.
1 - don't drop the leaked kernel samples.

Does this need to be conditional at all?

At least for sampling the GPRs, we could do something like below
unconditionally, which seems sufficient for my test cases.

Mark.

---->8----
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 67612ce359ad..79a21531d57c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6359,6 +6359,24 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
return callchain ?: &__empty_callchain;
}
+static struct pt_regs *perf_get_sample_regs(struct perf_event *event, struct pt_regs *regs)
+{
+ /*
+ * Due to interrupt latency (AKA "skid"), we may enter the kernel
+ * before taking an overflow, even if the PMU is only counting user
+ * events.
+ *
+ * If we're not counting kernel events, always use the user regs when
+ * sampling.
+ *
+ * TODO: how does this interact with guest sampling?
+ */
+ if (event->attr.exclude_kernel && !user_mode(regs))
+ return task_pt_regs(current);
+
+ return regs;
+}
+
void perf_prepare_sample(struct perf_event_header *header,
struct perf_sample_data *data,
struct perf_event *event,
@@ -6366,6 +6384,8 @@ void perf_prepare_sample(struct perf_event_header *header,
{
u64 sample_type = event->attr.sample_type;
+ regs = perf_get_sample_regs(event, regs);
+
header->type = PERF_RECORD_SAMPLE;
header->size = sizeof(*header) + event->header_size;


Hi Mark,

Thanks for providing the patch. I understand this approach.

In my opinion, the skid window is from counter overflow to interrupt delivered. While if the skid window is too *big* (e.g. user -> kernel), it should be not very useful. So personally, I'd prefer to drop the samples.

Thanks
Jin Yao