Re: another use-after-free in ep_remove_wait_queue()

From: Suren Baghdasaryan
Date: Thu Jan 19 2023 - 16:09:45 EST


On Wed, Jan 18, 2023 at 7:06 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Fri, Jan 13, 2023 at 9:52 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> > On Thu, Jan 12, 2023 at 6:26 PM Munehisa Kamata <kamatam@xxxxxxxxxx> wrote:
> > >
> > > On Thu, 2023-01-12 22:01:24 +0000, Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 7:06 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, Jan 9, 2023 at 5:33 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Sun, Jan 8, 2023 at 3:49 PM Hillf Danton <hdanton@xxxxxxxx> wrote:
> > > > > > >
> > > > > > > On 8 Jan 2023 14:25:48 -0800 PM Munehisa Kamata <kamatam@xxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > That patch survived the repro in my original post, however, the waker
> > > > > > > > (rmdir) was getting stuck until a file descriptor of the epoll instance or
> > > > > > > > the pressure file got closed. So, if the following modified repro runs
> > > > > > > > with the patch, the waker never returns (unless the sleeper gets killed)
> > > > > > > > while holding cgroup_mutex. This doesn't seem to be what you expected to
> > > > > > > > see with the patch, does it? Even wake_up_all() does not appear to empty
> > > > > > > > the queue, but wake_up_pollfree() does.
> > > > > > >
> > > > > > > Thanks for your testing. And the debugging completes.
> > > > > > >
> > > > > > > Mind sending a patch with wake_up_pollfree() folded?
> > > > > >
> > > > > > I finally had some time to look into this issue. I don't think
> > > > > > delaying destruction in psi_trigger_destroy() because there are still
> > > > > > users of the trigger as Hillf suggested is a good way to go. Before
> > > > > > [1] correct trigger destruction was handled using a
> > > > > > psi_trigger.refcount. For some reason I thought it's not needed
> > > > > > anymore when we placed one-trigger-per-file restriction in that patch,
> > > > > > so I removed it. Obviously that was a wrong move, so I think the
> > > > > > cleanest way would be to bring back the refcounting. That way the last
> > > > > > user of the trigger (either psi_trigger_poll() or psi_fop_release())
> > > > > > will free the trigger.
> > > > > > I'll check once more to make sure I did not miss anything and if there
> > > > > > are no objections, will post a fix.
> > > > >
> > > > > Uh, I recalled now why refcounting was not helpful here. I'm making
> > > > > the same mistake of thinking that poll_wait() blocks until the call to
> > > > > wake_up() which is not the case. Let me think if there is anything
> > > > > better than wake_up_pollfree() for this case.
> > > >
> > > > Hi Munehisa,
> > > > Sorry for the delay. I was trying to reproduce the issue but even
> > > > after adding a delay before ep_remove_wait_queue() it did not happen.
> > >
> > > Hi Suren,
> > >
> > > Thank you for your help here.
> > >
> > > Just in case, do you have KASAN enabled in your config? If not, this may
> > > just silently corrupt a certain memory location and not immediately
> > > followed by obvious messages or noticeable event like oops.
> >
> > Yes, KASAN was enabled in my build.
> >
> > >
> > > > One thing about wake_up_pollfree() solution that does not seem right
> > > > to me is this comment at
> > > > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253:
> > > >
> > > > `In the very rare cases where a ->poll() implementation uses a
> > > > waitqueue whose lifetime is tied to a task rather than to the 'struct
> > > > file' being polled, this function must be called before the waitqueue
> > > > is freed...`
> > > >
> > > > In our case we free the waitqueue from cgroup_pressure_release(),
> > > > which is the handler for `release` operation on cgroup psi files. The
> > > > other place calling psi_trigger_destroy() is psi_fop_release(), which
> > > > is also tied to the lifetime to the psi files. Therefore the lifetime
> > > > of the trigger's waitqueue is tied to the lifetime of the files and
> > > > IIUC, we should not be required to use wake_up_pollfree().
> > > > Could you please post your .config file? I might be missing some
> > > > configuration which prevents the issue from happening on my side.
> > >
> > > Sure, here is my config.
> > >
> > > https://gist.github.com/kamatam9/a078bdd9f695e7a0767b061c60e48d50
> > >
> > > I confirmed that it's reliably reproducible with v6.2-rc3 as shown below.
> > >
> > > https://gist.github.com/kamatam9/096a79cf59d8ed8785c4267e917b8675
> >
> > Thanks! I'll try to figure out the difference.
>
> Sorry for the slow progress on this issue. I'm multiplexing between
> several tasks ATM but I did not forget about this one.
> Even though I still can't get the kasan UAF report, I clearly see the
> wrong order when tracing these functions and forcing the delay before
> ep_remove_wait_queue(). I don't think that should be happening, so
> something in the release process I think needs fixing. Will update
> once I figure out the root cause, hopefully sometime this week.

Hi Folks,
I spent some more time digging into the details and this is what's
happening. When we call rmdir to delete the cgroup with the pressure
file being epoll'ed, roughly the following call chain happens in the
context of the shell process:

do_rmdir
cgroup_rmdir
kernfs_drain_open_files
cgroup_file_release
cgroup_pressure_release
psi_trigger_destroy

Later on in the context of our reproducer, the last fput() is called
causing wait queue removal:

fput
ep_eventpoll_release
ep_free
ep_remove_wait_queue
remove_wait_queue

By this time psi_trigger_destroy() already destroyed the trigger's
waitqueue head and we hit UAF.
I think the conceptual problem here (or maybe that's by design?) is
that cgroup_file_release() is not really tied to the file's real
lifetime (when the last fput() is issued). Otherwise fput() would call
eventpoll_release() before f_op->release() and the order would be fine
(we would remove the wait queue first in eventpoll_release() and then
f_op->release() would cause trigger's destruction).
Considering these findings, I think we can use the wake_up_pollfree()
without contradicting the comment at
https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253
because indeed, cgroup_file_release() and therefore
psi_trigger_destroy() are not tied to the file's lifetime.

I'm CC'ing Tejun to check if this makes sense to him and
cgroup_file_release() is working as expected in this case.

Munehisha, if Tejun confirms this is all valid, could you please post
a patch replacing wake_up_interruptible() with wake_up_pollfree()? We
don't need to worry about wake_up_all() because we have a limitation
of one trigger per file descriptor:
https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L1419,
so there can be only one waiter.
Thanks,
Suren.


>
>
> > Suren.
> >
> > >
> > >
> > > Regards,
> > > Munehisa
> > >
> > >
> > > > Thanks,
> > > > Suren.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > [1] https://lore.kernel.org/lkml/20220111232309.1786347-1-surenb@xxxxxxxxxx/
> > > > > >
> > > > > > Thanks,
> > > > > > Suren.
> > > > > >
> > > > > > >
> > > > > > > Hillf
> > > >
> > > >