Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag

From: Matthew Wilcox
Date: Thu Mar 23 2023 - 23:30:48 EST


On Thu, Mar 23, 2023 at 06:56:34PM -0700, Hugh Dickins wrote:
> On Thu, 23 Mar 2023, Song Liu wrote:
> > On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote:
> > > > On an earlier audit, for different reasons, I did also run across
> > > > lib/buildid.c build_id_parse() using find_get_page() without checking
> > > > PageUptodate() - looks as if it might do the wrong thing if it races
> > > > with khugepaged collapsing text to huge, and should probably have a
> > > > similar fix.
> > >
> > > That shouldn't be using find_get_page(). It should probably use
> > > read_cache_folio() which will actually read in the data if it's not
> > > present in the page cache, and return an ERR_PTR if the data couldn't
> > > be read.
> >
> > build_id_parse() can be called from NMI, so I don't think we can let
> > read_cache_folio() read-in the data.
>
> Interesting.
>
> This being the same Layering_Violation_ID which is asking for a home in
> everyone's struct file? (Okay, I'm being disagreeable, no need to answer!)

Yes, that's the one. Part of the BPF "splatter stuff all across the
kernel that we don't really understand" (see, I can be just as
disagreeable!)

> I think even the current find_get_page() is unsafe from NMI: imagine that
> NMI interrupting a sequence (maybe THP collapse or splitting, maybe page
> migration, maybe others) when the page refcount has been frozen to 0:
> you'll just have to reboot the machine?

Correct; if it's been frozen by this CPU, we'll spin forever. I think
the other conditions where we retry are temporary and caused by
something _another_ CPU is doing. For example, if _this_ CPU is in the
middle of modifying the tree when an NMI happens, we won't hit the
xas_retry() case. That can only happen if we've observed something
another CPU did, and then a second write happened from that same other
CPU. The easiest example of that would be that we hit this kind of
race:

CPU 0 CPU 1
load root of tree, points to node
store entry in root of tree
wmb();
store RETRY entry in node
load entry from node, observe RETRY entry

The retry will happen and we'll observe the new state of the tree with
the entry we're looking for in the root.

If CPU 1 takes an NMI and interrupts itself, it will always see a
consistent tree.

> I guess the RCU-safety of find_get_page() implies that its XArray basics
> are safe in NMI; but you need a low-level variant (xas_find()?) which
> does none of the "goto retry"s, and fails immediately if anything is
> wrong - including !PageUptodate.

The Uptodate flag check needs to be done by the caller; the
find_get_page() family return !uptodate pages.

But find_get_page() does not advertise itself as NMI-safe. And I
think it's wrong to try to make it NMI-safe. Most of the kernel is
not NMI-safe. I think it's incumbent on the BPF people to get the
information they need ahead of taking the NMI. NMI handlers are not
supposed to be doing a huge amount of work! I don't really understand
why it needs to do work in NMI context; surely it can note the location of
the fault and queue work to be done later (eg on irq-enable, task-switch
or return-to-user)