Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints

From: Axel Rasmussen
Date: Fri Dec 04 2020 - 12:48:08 EST


On Fri, Dec 4, 2020 at 8:36 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>
> On 12/2/20 2:11 AM, Shakeel Butt wrote:
> > On Tue, Dec 1, 2020 at 5:07 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >>
> >> On Tue, 1 Dec 2020 16:36:32 -0800
> >> Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> >>
> >> > SGTM but note that usually Andrew squash all the patches into one
> >> > before sending to Linus. If you plan to replace the path buffer with
> >> > integer IDs then no need to spend time fixing buffer related bug.
> >>
> >> I don't think Andrew squashes all the patches. I believe he sends Linus
> >> a patch series.
> >
> > I am talking about the patch and the following fixes to that patch.
> > Those are usually squashed into one patch.
>
> Yeah, if there's a way forward that doesn't need to construct full path on each
> event and the associated complexity and just use an ID, let's convert to the ID
> and squash it, for less churn. Especially if there are other existing
> tracepoints that use the ID.

The thing I worry about is that it being inconvenient will prevent
folks from using it to send us data on how mmap_lock behaves under
their workloads. Anecdotally, I talked to a few teams within Google,
and it seems those using containers use paths instead of IDs to refer
to them, and they don't have existing infrastructure to keep track of
the mapping. So to collect data from those workloads, we'd have to
wait for them to build such a thing, vs. just asking them to run a
bash one-liner. I haven't done a wider survey, but I suspect the same
is true for users of e.g. Docker or Kubernetes.

>
> If there's further (somewhat orthogonal) work to make the IDs easier for
> userspace, it can be added on top later, but really shouldn't need to add the
> current complex solution only to remove it later?

I'm on board with Shakeel's idea to export this on cgroup
creation/deletion instead, since it lets us keep the complexity in
exactly one place. I think such a thing would use the same code I'm
proposing now, though, so we wouldn't just be deleting it if we merge
it. Also, before doing that I think it's worth identifying a second
user within the kernel (maybe the writeback tracepoint mentioned
earlier in the thread?), so doing it incrementally seems reasonable to
me.

This plan is also in-line with existing stuff we provide. Histogram
triggers provide utilities like ".execname" (PID -> execname) or
".sym" (address -> symbol) for convenience. Sure, userspace could
figure these things out for itself, but it's convenient to provide
them directly, and it's not so bad since we have exactly one place in
the tracing infrastructure that knows how to do these translations.

Actually, maybe a ".cgpath" variant would be better than a separate
tracepoint. I haven't looked at what either approach would require in
detail; maybe another reason to do this iteratively. :)

>
> Thanks,
> Vlastimil