Re: [PATCH] maple_tree: Fix sparse reported issues
From: Liam Howlett
Date: Mon Jul 18 2022 - 08:57:29 EST
* Hugh Dickins <hughd@xxxxxxxxxx> [220718 00:28]:
> On Mon, 18 Jul 2022, Liam Howlett wrote:
> > * Hugh Dickins <hughd@xxxxxxxxxx> [220717 16:58]:
> > > On Fri, 15 Jul 2022, Liam Howlett wrote:
> > > >
> > > > Please find attached the last outstanding fix for this series. It
> > > > covers a rare failure scenario that one of the build bots reported.
> > > > Ironically, it changes __vma_adjust() more than the rest of the series,
> > > > but leaves the locking in the existing order.
> > >
> > > Thanks, I folded that in to my testing on next-20220715, along with
> > > other you posted on Friday (mas_dead_leaves() walk fix);
> >
> > Please drop that patch, it needs more testing.
>
> Drop the mas_dead_leaves() walk fix, or the __vma_adjust() changes
> which you attached in that mail to me? please let me know a.s.a.p,
> since I cannot proceed without knowing which.
Drop the mas_dead_leaves() walk fix please. I responded to the patch
that it's not tested enough. I'll respond to the rest of this email
soon.
>
> >
> > > but have not
> > > even glanced at the v11 you posted Saturday, though I see from responses
> > > that v11 has some other changes, including __vma_adjust() again, but I
> > > just have not looked. (I've had my own experiments in __vma_adjust()).
> > >
> > > You'll be wanting my report, I'll give it here rather than in the v11
> > > thread. In short, some progress, but still frustratingly none on the
> > > main crashes.
> > >
> > > 1. swapops.h BUG on !PageLocked migration entry. This is *not* the
> > > most urgent to fix, I'm just listing it first to get it out of the way
> > > here. This is the BUG that terminates every tmpfs swapping load test
> > > on the laptop: only progress was that, just one time, the workstation
> > > hit it before hitting its usual problems: nice to see it there too.
> > > I'll worry about this bug when the rest is running stably. I've only
> > > ever noticed it when it's brk being unmapped, I expect that's a clue.
> >
> > Thanks for pointing me towards a usable reproducer. I should be able to
> > narrow it down from there, especially with the brk hint.
>
> I'm afraid I cannot point you to a good reproducer; but anyway, the BUG
> necessarily appears some time after whatever code is at fault has been
> exercised, so it needs thought rather than any reproducer. It was not
> obvious to me, but I'll think it through again, once the other issues
> are resolved.
>
> >
> > >
> > > 2. Silly in do_mas_mumap():
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2513,7 +2513,7 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
> > > arch_unmap(mm, start, end);
> > >
> > > /* Find the first overlapping VMA */
> > > - vma = mas_find(mas, end - 1);
> > > + vma = mas_find(mas, start);
> > > if (!vma)
> > > return 0;
> > >
> > > Fixing that does fix some bad behaviors seen - I'd been having crashes in
> > > unmap_vmas() and unlink_anon_vmas(), and put "if (WARN_ON(!vma)) return;"
> > > in unmap_region(); but that no longer seems necessary now do_mas_munmap()
> > > is fixed thus. I had high hopes that it would fix all the rest, but no.
> >
> > This is actually correct. mas_find() is not the same as vma_find().
> > mas_find() uses the maple state index and searches until a limit (end
> > -1 in this case). I haven't seen these crashes, but I think you are
> > hitting the same issue you are discussing in #6 below. I also hadn't
> > realised the potential confusion of those APIs.
>
> You're right, I'm wrong, sorry about that. But then I'm left with the
> conundrum of how a class of crashes went away when I changed that. Did
> I break it all so badly that it couldn't reach the anon_vma bugs I was
> hitting before? Or did making that change coincide with my moving from
> DEBUG_MAPLE off to on, so different crashes came sooner? Or did I fold
> in another fix from you around that time? I don't know (and I don't
> expect you to answer this!), but I've got some backtracking to do.
>
> >
> > >
> > > 3. vma_adjust_trans_huge(). Skip this paragraph, I think there
> > > is actually no problem here, but for safety I have rearranged the
> > > vma_adjust_trans_huge()s inside the rmap locks, because when things
> > > aren't working right, best to rule out some possibilities. Why am
> > > I even mentioning it here? In case I send any code snippets and
> > > you're puzzled by vma_adjust_trans_huge() having moved.
> >
> > Thanks, I will keep this in mind.
> >
> > >
> > > 4. unlink_anon_vmas() in __vma_adjust(). Not urgent to fix (can only
> > > be an issue when GFP_KERNEL preallocation fails, which I think means
> > > when current is being OOM-killed), but whereas vma_expand() has careful
> > > anon_cloned flag, __vma_adjust() does not, and I think could be
> > > unlinking a pre-existing anon_vma. Aside from that, I am nervous about
> > > using unlink_anon_vmas() there like that (and in vma_expand()): IIUC it's
> > > an unnecessary "optimization" for a very unlikely case, that's in danger
> > > of doing more harm than good; and should be removed from them both (then
> > > they can both simply return -ENOMEM when mas_preallocate() fails).
> >
> > I will add a flag to __vma_adjust, but I don't see how it could happen.
> > I guess if importer has an anon_vma already? I added these as an unwind
> > since I don't want to leak - even on the rare preallocation failure. If
> > you don't want to unroll, what would you suggest in these cases? Would
> > a flag be acceptable?
>
> I cannot see what purpose adding a flag to __vma_adjust() would serve.
> If importer had anon_vma already, yes, without checking the code again,
> that was I think what I had in mind. But (correct me if you've observed
> that I'm wrong) there's no question of a leak there: a vma which wasn't
> given an anon_vma before gets linked in to one, and it will all get torn
> down correctly when the process exits (indeed, when OOM kill completes).
>
> It's "nice" to delay setting anon_vma until it's needed, but not worth
> any effort to rewind it (especially on such an unlikely path): and
> normally, once anon_vma has been set, it stays set - I'm not sure of
> the consequence of unsetting it again (racing with rmap lookups: may
> be okay because of how anon_vma is not trusted when page is not mapped;
> but it's easier just to omit the rewind than think all that through).
>
> >
> > >
> > > 5. I was horrified/thrilled to notice last night that mas_store_prealloc()
> > > does a mas_destroy() at the end. So only the first vma_mas_store() in
> > > __vma_adjust() gets to use the carefully preallocated nodes. I thought
> > > that might be responsible for lots of nastiness, but again sadly no.
> > > Maybe it just falls back to GFP_ATOMIC when the preallocateds are gone
> > > (I didn't look) and that often works okay. Whether the carefully
> > > chosen prealloc count allows for __vma_adjust() use would be another
> > > question. (Earlier on I did try doubling its calculation, in case it
> > > was preallocating too few, but that made no difference either.)
> >
> > mas_store_prealloc will allocate for the worst case scenario. Since I
> > cannot guarantee the second store isn't the worst case, and could fail,
> > I cannot allow for more than one allocation per preallocate. I thought
> > I was fine in __vma_adjust since I preallocate after the goto label for
> > a second removal but it turns out I wasn't since the second preallocate
> > could fail, so I've removed the requirement for a second store for 'case
> > 6' by updating the tree once and removing both VMAs in a single pass.
> > There could, potentially be an issue if the caller to __vma_merge()
> > wanted to reduce one limit of the VMA (I guess just the start..) and
> > also remove one or more VMAs, and also be in a situation where
> > allocations could cause issues (fs_reclaim).. and since __vma_adjust()
> > is so complicated, I looked at the callers. Most use vma_merge(), and
> > those seem fine. fs/exec only adjusts one at a time. when splitting,
> > only a single insert happens. Did I miss some situation(s)?
>
> I don't think the fs/exec stack moving will be any worry for this.
> Did you miss any case? Yes, the "insert" cases from __split_vma().
> I have no appreciation of when maple tree actually needs to make an
> allocation, so I don't know whether the adjust_next case ever needs
> to make an allocation, but I'd have thought the insert case might
> need to sometimes.
>
> But I'll admit to skimming your paragraph there, I'm more concerned
> to go on to the following issue than fully grasp your argument above.
>
> >
> > >
> > > 6. The main problem, crashes on the workstation (never seen on the
> > > laptop). I keep hacking around with the debug info (and, please,
> > > %px not %p, %lx not %lu in the debug info: I've changed them all,
> >
> > Okay, so I tried to remove all %px in the debug code so I'll revert
> > those. I use %lu for historic reasons from mt_dump(), I can change
> > those too, The tree uses ranges to store pointers so I print the
> > pointers in %lx and the ranges in %lu.
> >
> >
> > > and a couple of %lds, in my copy of lib/maple_tree.c). I forget
> > > how the typical crashes appeared with the MAPLE_DEBUGs turned off
> > > (the BUG_ON(count != mm->map_count) in exit_mmap() perhaps); I've
> > > turned them on, but usually comment out the mt_validate() and
> > > mt_dump(), which generate far too much noise for non-specialists,
> > > and delay the onset of crash tenfold (but re-enabled to give you
> > > the attached messages.xz).
> > >
> > > Every time, it's the cc1 (9.3.1) doing some munmapping (cc1 is
> > > mostly what's running in this load test, so that's not surprising;
> > > but surprising that even when I switched the laptop to using same
> > > gcc-9, couldn't reproduce the problem there). Typically, that
> > > munmapping has involved splitting a small, say three page, vma
> > > into two pages below and one page above (the "insert", and I've
> > > hacked the debug to dump that too, see "mmap: insert" - ah,
> > > looking at the messages now, the insert is bigger this time).
> > >
> > > And what has happened each time I've studied it (I don't know
> > > if this is evident from the mt dumps in the messages, I hope so)
> > > is that the vma and the insert have been correctly placed in the
> > > tree, except that the vma has *also* been placed several pages
> > > earlier, and a linear search of the tree finds that misplaced
> > > instance first, vm_start not matching mt index.
> > >
> > > The map_count in these cases has always been around 59, 60, 61:
> > > maybe just typical for cc1, or maybe significant for maple tree?
> > >
> > > I won't give up quite yet, but I'm hoping you'll have an idea for
> > > the misplaced tree entry. Something going wrong in rebalancing?
> > >
> > > For a long time I assumed the problem would be at the mm/mmap.c end,
> > > and I've repeatedly tried different things with the vma_mas stuff
> > > in __vma_adjust() (for example, using vma_mas_remove() to remove
> > > vmas before re-adding them, and/or doing mas_reset() in more places);
> > > but none of those attempts actually fixed the issue. So now I'm
> > > thinking the problem is really at the lib/maple_tree.c end.
> > >
> >
> > Do you have the patch
> > "maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
> > like your issue fits this fix exactly. I was seeing the same issue with
> > gcc 9.3.1 20200408 and this bug doesn't happen for me now. The logs
> > you sent also fit the situation. I went through the same exercise
> > (exorcism?) of debugging the various additions and removals of the VMA
> > only to find the issue in the tree itself. The fix also modified the
> > test code to detect the issue - which was actually hit but not detected
> > in the existing test cases from a live capture of VMA activities. It is
> > difficult to spot in the tree dump as well. I am sure I sent this to
> > Andrew as it is included in v11 and did not show up in his diff, but I
> > cannot find it on lore, perhaps I forgot to CC you? I've attached it
> > here for you in case you missed it.
>
> Thanks! No, I never received that patch, nor can I see it on lore
> or marc.info; but I (still) haven't looked at v11, and don't know
> about Andrew's diff. Anyway, sounds exciting, I'm eager to stop
> writing this mail and get to testing with that in - but please
> let me know whether it's the mas_dead_leaves() or the __vma_adjust()
> mods you attached previously, which you want me to leave out.
>
> >
> > You are actually hitting the issue several iterations beyond when it
> > first occurred. It was introduced earlier in the tree and exposed with
> > your other operations by means of a node split or merge.
> >
> > > 7. If you get to do cleanups later, please shrink those double blank
> > > lines to single blank lines. And find a better name for the strange
> > > vma_mas_szero() - vma_mas_erase(), or is erase something different?
> > > I'm not even sure that it's needed: that's a special case for exec's
> > > shift_arg_pages() VM_STACK_INCOMPLETE_SETUP, which uses __vma_adjust()
> > > in a non-standard way. And trace_vma_mas_szero() in vma_mas_remove()
> > > looks wrong.
> >
> > Okay, I'll be sure to only have one blank line. Where do you see this?
> > I would have thought that checkpatch would complain? I did try to,
>
> No, I'm not going to search for those double blank lines now:
> please do your own diff and look through for them. And I don't know
> whether checkpatch objects to them or not, but they're bad for patch
> application, since they increase the likelihood that a patch applies
> in the wrong place.
>
> As to whether this is or is not a good time for such cleanups,
> I just don't know: I see Andrew on the one hand intending to drop
> maple tree for the moment, but Linus on the other hand promising
> an extra week for 5.19. I'll just let others decide what they want.
>
> Hugh
>
> > regretfully, address more checkpatch issues on v11. It added more noise
> > to the differences of v10 + patches to v11 than anything else.
> >
> >
> > Thanks again,
> > Liam