Re: [PATCH v3 1/2] exec: add PR_HIDE_SELF_EXE prctl

From: Christian Brauner
Date: Mon Jan 30 2023 - 05:06:18 EST


On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote:
> On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote:
> >
> >
> > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
> > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
> > >>
> > >>
> > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
> > >> >
> > >> > After reading some comments on the LWN.net article, I wonder if
> > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
> > >> > namespace or if in this case root should keep the privilege to inspect
> > >> > the binary of a process. If a container runs with that many privileges
> > >> > then it has already other ways to damage the host anyway.
> > >>
> > >> Right, that's what I was trying to express with the "make it work the same as map_files". Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.
> > >
> > > If this can be circumvented via CAP_SYS_ADMIN
> >
> > To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl(). (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns)
>
> Oh no, I fully understand. The point was that the userspace fix protects
> even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was
> important back then and is still relevant today for some workloads.
>
> For unprivileged containers where host and container are separate by a
> meaningful user namespace boundary this whole mitigation is irrelevant
> as the binary can't be overwritten.
>
> >
> > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.
> >
> > > then this mitigation
> > > becomes immediately way less interesting because the userspace
> > > mitigation we came up with protects against CAP_SYS_ADMIN as well
> > > without any regression risk.
> >
> > The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
>
> But this is a problem with the memfd api not with the fix. Following the
> thread the ability to create executable memfds will stay around. As it
> should be given how long this has been supported. And they have backward
> compatibility in mind which is great.

Following up from yesterday's promise to check with the criu org I'm
part of: this is going to break criu unforunately as it dumps (and
restores) /proc/self/exe. Even with an escape hatch we'd still risk
breaking it. Whereas again, the memfd solution doesn't cause those
issues.

Don't get me wrong it's pretty obvious that I was pretty supportive of
this fix especially because it looked rather simple but this is turning
out to be less simple than we tought. I don't think that this is worth
it given the functioning fixes we already have.

The good thing is that - even if it will take a longer - that Aleksa's
patchset will provide a more general solution by making it possible for
runc/crun/lxc to open the target binary with a restricted upgrade mask
making it impossible to open the binary read-write again. This won't
break criu and will fix this issue and is generally useful.