Re: [PATCH v2] perf: Allow restricted kernel breakpoints on user addresses

From: Marco Elver
Date: Mon Jan 30 2023 - 02:00:14 EST


On Fri, 27 Jan 2023 at 19:14, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> Hi Marco,
>
> Apologies for having not replies on v1...
>
> On Fri, Jan 27, 2023 at 05:24:09PM +0100, Marco Elver wrote:
> > Allow the creation of restricted breakpoint perf events that also fire
> > in the kernel (perf_event_attr::exclude_kernel=0), if:
> >
> > 1. No sample information is requested; samples may contain IPs,
> > registers, or other information that may disclose kernel addresses.
> >
> > 2. The breakpoint (viz. data watchpoint) is on a user address.
>
> I think there's a potential problem here w.r.t. what constitutes a "user
> address". Below, the patch assumes that any address which access_ok() is happy
> with is a user address, but that's not always the case, and it's not
> necessarily always safe to allow watchpoints on such addresses.

Isn't that a deficiency with access_ok()?

https://www.kernel.org/doc/html/latest/core-api/mm-api.html#c.access_ok
"Checks if a pointer to a block of memory in user space is valid. [...]"

> For example, UEFI runtime services may live in low adddresses below
> TASK_SIZE_MAX, and there are times when we run code in an idmap (or other
> low-half mapping) when we cannot safely take an exception for things like idle,
> suspend, kexec, pagetable rewriting on arm64, etc.
>
> So I think this may introduce functional issues (e.g. a mechanism to crash the
> kernel) in addition to any potential information disclosure, and I would not
> want this to be generally available to unprivileged users.
>
> Most of those happen in kernel threads, but they can also happen in the context
> of user threads (e.g. if triggering suspend/idle via sysfs), so special care
> will be needed, as above.

These are good points.

> > The rules constrain the allowable perf events such that no sensitive
> > kernel information can be disclosed.
> >
> > Despite no explicit kernel information disclosure, the following
> > questions may need answers:
> >
> > 1. Q: Is obtaining information that the kernel accessed a particular
> > user's known memory location revealing new information?
> >
> > A: Given the kernel's user space ABI, there should be no "surprise
> > accesses" to user space memory in the first place.
>
> I think that may be true for userspace, but not true for other transient
> mappings in the low half of the address space. Ignoring the functional concern
> above, for idmap'd code this would at least provide a mechanism to probe for
> the phyiscal address of that code (and by extension, reveal the phyiscal
> location of the entire kernel).

This again feels like a deficiency with access_ok(). Is there a better
primitive than access_ok(), or can we have something that gives us the
guarantee that whatever it says is "ok" is a userspace address?

> > 2. Q: Does causing breakpoints on user memory accesses by the kernel
> > potentially impact timing in a sensitive way?
> >
> > A: Since hardware breakpoints trigger regardless of the state of
> > perf_event_attr::exclude_kernel, but are filtered in the perf
> > subsystem, this possibility already exists independent of the
> > proposed change.
>
> Hmm... arm64's HW breakpoints and watchpoints have HW privilege filters, so I'm
> not sure the above statement is generally/necessarily true.

Right, I can see this being a valid concern on those architectures
that do support HW privilege filters.

> > Motivation: Data breakpoints on user addresses that also fire in the
> > kernel provide complete coverage to track and debug accesses, not just
> > in user space but also through the kernel. For example, tracking where
> > user space invokes syscalls with pointers to specific memory.
> >
> > Breakpoints can be used for more complex dynamic analysis, such as race
> > detection, memory-safety error detection, or data-flow analysis. Larger
> > deployment by linking such dynamic analysis into binaries in production
> > only becomes possible when no additional capabilities are required by
> > unprivileged users. To improve coverage, it should then also be possible
> > to enable breakpoints on user addresses that fire in the kernel with no
> > additional capabilities.
>
> I can understand the argument for watchpoints (modulo my concerns above), but
> there's no need to support instruction breakpoints, right? i.e. there's no
> legitimate reason for a user to want to monitor a given user address
> system-wide, regardless of what's running?
>
> IIUC this only makes sense for watchpoints, and only in the context of a given
> task.

Right, there shouldn't be a need for instruction breakpoints, the
kernel shouldn't be executing user code.

Thanks,
-- Marco