Re: [PATCH v3 1/2] exec: add PR_HIDE_SELF_EXE prctl

From: Andrei Vagin
Date: Fri Feb 24 2023 - 19:27:27 EST


On Mon, Jan 30, 2023 at 11:06:02AM +0100, Christian Brauner wrote:
> On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote:
> > On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote:
> > >
> > >
> > > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
> > > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
> > > >>
> > > >>
> > > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
> > > >> >
> > > >> > After reading some comments on the LWN.net article, I wonder if
> > > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
> > > >> > namespace or if in this case root should keep the privilege to inspect
> > > >> > the binary of a process. If a container runs with that many privileges
> > > >> > then it has already other ways to damage the host anyway.
> > > >>
> > > >> Right, that's what I was trying to express with the "make it work the same as map_files". Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.
> > > >
> > > > If this can be circumvented via CAP_SYS_ADMIN
> > >
> > > To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl(). (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns)
> >
> > Oh no, I fully understand. The point was that the userspace fix protects
> > even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was
> > important back then and is still relevant today for some workloads.
> >
> > For unprivileged containers where host and container are separate by a
> > meaningful user namespace boundary this whole mitigation is irrelevant
> > as the binary can't be overwritten.
> >
> > >
> > > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.
> > >
> > > > then this mitigation
> > > > becomes immediately way less interesting because the userspace
> > > > mitigation we came up with protects against CAP_SYS_ADMIN as well
> > > > without any regression risk.
> > >
> > > The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
> >
> > But this is a problem with the memfd api not with the fix. Following the
> > thread the ability to create executable memfds will stay around. As it
> > should be given how long this has been supported. And they have backward
> > compatibility in mind which is great.
>
> Following up from yesterday's promise to check with the criu org I'm
> part of: this is going to break criu unforunately as it dumps (and
> restores) /proc/self/exe. Even with an escape hatch we'd still risk
> breaking it. Whereas again, the memfd solution doesn't cause those
> issues.
>
> Don't get me wrong it's pretty obvious that I was pretty supportive of
> this fix especially because it looked rather simple but this is turning
> out to be less simple than we tought. I don't think that this is worth
> it given the functioning fixes we already have.

btw: can we use PR_SET_MM_EXE_FILE or PR_SET_MM_MAP (prctl_map.exe_fd) to
set a dummy exe. Will it have the required effect?

>
> The good thing is that - even if it will take a longer - that Aleksa's
> patchset will provide a more general solution by making it possible for
> runc/crun/lxc to open the target binary with a restricted upgrade mask
> making it impossible to open the binary read-write again. This won't
> break criu and will fix this issue and is generally useful.