Re: [RFC PATCH 0/4] Fix excessive CPU usage during compaction

From: Mel Gorman
Date: Thu Jan 26 2023 - 04:04:22 EST


On Wed, Jan 25, 2023 at 05:11:59PM -0800, Andrew Morton wrote:
> On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> > Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
> > fixed a problem where pageblocks found by fast_find_migrateblock() were
> > ignored. Unfortunately there were numerous bug reports complaining about high
> > CPU usage and massive stalls once 6.1 was released. Due to the severity,
> > the patch was reverted by Vlastimil as a short-term fix[1] to -stable and
> > is currently sitting in the Andrew's git branch mm/mm-hotfixes-unstable.
> >
> > The underlying problem for each of the bugs is suspected to be the
> > repeated scanning of the same pageblocks. This series should guarantee
> > forward progress even with commit 7efc3b726103. More information is in
> > the changelog for patch 4.
> >
> > If this series is accepted and merged after the revert of 7efc3b726103
> > then a "revert of the revert" will be needed.
>
> If we drop Vlastimil's reversion and apply this, the whole series
> should be cc:stable and it isn't really designed for that.
>
> So I think either
>
> a) drop Vlastimil's reversion and persuade Mel to send us a minimal
> version of patch #4 for -stable consumption. Patches 1-3 of this
> series come later.
>
> b) go ahead with Vlastimil's revert for -stable, queue up this
> series for 6.3-rc1 and redo the original "fix set skip in
> fast_find_migrateblock" some time in the future.
>
> If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction:
> Finish pageblocks on complete migration failure" is inappropriate -
> fixing a reverted commit which Vlastimil's revert already fixed.
>
> I'll plan on b) for now.

I think plan b) is more appropriate because Vlastimil's revert contains
useful information on this class of bug that may be useful again in the
future. There is no harm preserving that in both the mainline and -stable
git history.

The series worked for me but that was a limited test case and high CPU
usage bugs could crop up again in 6.2. If such bugs show up then reverting
("mm/compaction: fix set skip in fast_find_migrateblock") becomes a
short-term option again while it is analysed.

I disagee that the Fixes tag is inappropriate. Commit 7efc3b726103 is simply
a messenger that fast_find_migrateblock() was broken and while it fixed one
problem, it triggered many more because the fix was incomplete. Vlastimil's
fix was a short-term "fix" to mitigate the fallout until a proper fix was
available. This series is a more comprehensive fix but both are "Fixes"
to commit 7efc3b726103 in terms of what kernel has very obvious problems
that needs to be fixed. Any kernel including 7efc3b726103 should include
either Vlastimil's fix, mine or both if git history is being preserved by the
backports. While there is an older commit that broke fast_find_migrateblock,
most people simply didn't notice where as commit 7efc3b726103 was noticeable
by a number of people very quickly.

However, if you disagree then the appropriate fixes would be

Fixes: 70b44595eafe9 ("mm, compaction: use free lists to quickly locate a migration source")

That would at least indicate to anyone doing the backport that they need
to ensure the backport does not have any hidden dependencies.

--
Mel Gorman
SUSE Labs