Re: [PATCH v3 1/2] exec: add PR_HIDE_SELF_EXE prctl

From: Giuseppe Scrivano
Date: Tue Jan 31 2023 - 09:18:14 EST


Christian Brauner <brauner@xxxxxxxxxx> writes:

> On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote:
>> On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote:
>> >
>> >
>> > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
>> > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
>> > >>
>> > >>
>> > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
>> > >> >
>> > >> > After reading some comments on the LWN.net article, I wonder if
>> > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
>> > >> > namespace or if in this case root should keep the privilege to inspect
>> > >> > the binary of a process. If a container runs with that many privileges
>> > >> > then it has already other ways to damage the host anyway.
>> > >>
>> > >> Right, that's what I was trying to express with the "make it
>> > >> work the same as map_files". Hiding the entry entirely even
>> > >> for initial-namespace-root (real root) seems like it's going to
>> > >> potentially confuse profiling/tracing/debugging tools for no
>> > >> good reason.
>> > >
>> > > If this can be circumvented via CAP_SYS_ADMIN
>> >
>> > To be clear, I'm proposing CAP_SYS_ADMIN in the current user
>> > namespace at the time of the prctl(). (Or if keeping around a
>> > reference just for this is too problematic, perhaps hardcoding to
>> > the init ns)
>>
>> Oh no, I fully understand. The point was that the userspace fix protects
>> even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was
>> important back then and is still relevant today for some workloads.
>>
>> For unprivileged containers where host and container are separate by a
>> meaningful user namespace boundary this whole mitigation is irrelevant
>> as the binary can't be overwritten.
>>
>> >
>> > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.
>> >
>> > > then this mitigation
>> > > becomes immediately way less interesting because the userspace
>> > > mitigation we came up with protects against CAP_SYS_ADMIN as well
>> > > without any regression risk.
>> >
>> > The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
>>
>> But this is a problem with the memfd api not with the fix. Following the
>> thread the ability to create executable memfds will stay around. As it
>> should be given how long this has been supported. And they have backward
>> compatibility in mind which is great.
>
> Following up from yesterday's promise to check with the criu org I'm
> part of: this is going to break criu unforunately as it dumps (and
> restores) /proc/self/exe. Even with an escape hatch we'd still risk
> breaking it. Whereas again, the memfd solution doesn't cause those
> issues.
>
> Don't get me wrong it's pretty obvious that I was pretty supportive of
> this fix especially because it looked rather simple but this is turning
> out to be less simple than we tought. I don't think that this is worth
> it given the functioning fixes we already have.
>
> The good thing is that - even if it will take a longer - that Aleksa's
> patchset will provide a more general solution by making it possible for
> runc/crun/lxc to open the target binary with a restricted upgrade mask
> making it impossible to open the binary read-write again. This won't
> break criu and will fix this issue and is generally useful.

I was not aware that running with CAP_SYS_ADMIN in the initial userns
was considered as a use case, but in this case don't we need to protect
/proc/$PID/map_files as well or do we rely only on randomize_va_space?
It is a more difficult to guess the name but we can still exec these
files and grab a reference to them.

The current patch I've proposed is probably a too big hammer for the
small issue we really have:

other processes from the container are already blocked by PR_SET_DUMPABLE unless
CAP_SYS_PTRACE is granted; but if CAP_SYS_PTRACE is granted then it seems
already vulnerable today since processes from the container can just
read the /proc/PID/map_files files without even requiring the exec trick.

So the only hole left, that I can see, is that the container runtime
can be tricked to exec /proc/self/exe (or /proc/self/map_files/*) and
from there open a reference to the binary.

Could we just restrict the usage to the current thread group? That
won't affect in any way other processes.

The patch won't be too much more complicated, we just need to amend the
following fix:

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e9127084b82a..2f5c5ed2dae8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1723,6 +1723,7 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
{
struct task_struct *task;
struct file *exe_file;
+ bool is_same_tgroup;
long hide_self_exe;

task = get_proc_task(d_inode(dentry));
@@ -1730,8 +1731,9 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
return -ENOENT;
exe_file = get_task_exe_file(task);
hide_self_exe = task_hide_self_exe(task);
+ is_same_tgroup = same_thread_group(current, task);
put_task_struct(task);
- if (hide_self_exe)
+ if (hide_self_exe && is_same_tgroup)
return -EPERM;
if (exe_file) {
*exe_path = exe_file->f_path;

Would that be sufficient or are there other ways to attack it?

Given the premise about CAP_SYS_ADMIN (and even more loosen
CAP_CHECKPOINT_RESTORE in the *current* user namespace), I think we
probably need a similar protection fo /proc/PID/map_files.

Regards,
Giuseppe