Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed

From: Vlastimil Babka
Date: Thu Sep 05 2019 - 07:22:43 EST

Next message: Heiko Stuebner: "Re: [PATCH v1 0/3] clk: rockchip: support clock controller for rk3308 SoC"
Previous message: Lorenzo Pieralisi: "Re: [PATCH V4 0/6] PCI: tegra: Enable PCIe C5 controller of Tegra194 in p2972-0000 platform"
In reply to: Michal Hocko: "Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed"
Next in thread: Mike Kravetz: "Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 9/5/19 11:00 AM, Michal Hocko wrote:
> [Ccing Mike for checking on the hugetlb side of this change]
>
> On Wed 04-09-19 12:54:22, David Rientjes wrote:
>> Memory compaction has a couple significant drawbacks as the allocation
>> order increases, specifically:
>>
>> - isolate_freepages() is responsible for finding free pages to use as
>> migration targets and is implemented as a linear scan of memory
>> starting at the end of a zone,

Note that's no longer entirely true, see fast_isolate_freepages().

>> - failing order-0 watermark checks in memory compaction does not account
>> for how far below the watermarks the zone actually is: to enable
>> migration, there must be *some* free memory available. Per the above,
>> watermarks are not always suffficient if isolate_freepages() cannot
>> find the free memory but it could require hundreds of MBs of reclaim to
>> even reach this threshold (read: potentially very expensive reclaim with
>> no indication compaction can be successful), and

I doubt it's hundreds of MBs for a 2MB hugepage.

>> - if compaction at this order has failed recently so that it does not even
>> run as a result of deferred compaction, looping through reclaim can often
>> be pointless.

Agreed.

>> For hugepage allocations, these are quite substantial drawbacks because
>> these are very high order allocations (order-9 on x86) and falling back to
>> doing reclaim can potentially be *very* expensive without any indication
>> that compaction would even be successful.

You seem to lump together hugetlbfs and THP here, by saying "hugepage",
but these are very different things - hugetlbfs reservations are
expected to be potentially expensive.

>> Reclaim itself is unlikely to free entire pageblocks and certainly no
>> reliance should be put on it to do so in isolation (recall lumpy reclaim).
>> This means we should avoid reclaim and simply fail hugepage allocation if
>> compaction is deferred.

It is however possible that reclaim frees enough to make even a
previously deferred compaction succeed.

>> It is also not helpful to thrash a zone by doing excessive reclaim if
>> compaction may not be able to access that memory. If order-0 watermarks
>> fail and the allocation order is sufficiently large, it is likely better
>> to fail the allocation rather than thrashing the zone.
>>
>> Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
>> ---
>> mm/page_alloc.c | 22 ++++++++++++++++++++++
>> 1 file changed, 22 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> if (page)
>> goto got_pg;
>>
>> + if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
>> + /*
>> + * If allocating entire pageblock(s) and compaction
>> + * failed because all zones are below low watermarks
>> + * or is prohibited because it recently failed at this
>> + * order, fail immediately.
>> + *
>> + * Reclaim is
>> + * - potentially very expensive because zones are far
>> + * below their low watermarks or this is part of very
>> + * bursty high order allocations,
>> + * - not guaranteed to help because isolate_freepages()
>> + * may not iterate over freed pages as part of its
>> + * linear scan, and
>> + * - unlikely to make entire pageblocks free on its
>> + * own.
>> + */
>> + if (compact_result == COMPACT_SKIPPED ||
>> + compact_result == COMPACT_DEFERRED)
>> + goto nopage;

As I said, I expect this will make hugetlbfs reservations fail
prematurely - Mike can probably confirm or disprove that.
I think it also addresses consequences, not the primary problem, IMHO.
I believe the primary problem is that we reclaim something even if
there's enough memory for compaction. This won't change with your patch,
as compact_result won't be SKIPPED in that case. Then we continue
through to __alloc_pages_direct_reclaim(), shrink_zones() which will
call compaction_ready(), which will only return true and skip reclaim of
the zone, if there's high_watermark (!!!) + compact_gap() pages. But as
long as one zone isn't compaction_ready(), we enter shrink_node(), which
will reclaim something and call should_continue_reclaim() where we might
finally notice that compaction_suitable() returns CONTINUE, and abort
reclaim.

Thus I think the right solution might be to really avoid reclaim for
zones where compaction is not skipped, while your patch avoids reclaim
when compaction is skipped. The per-node reclaim vs per-zone compaction
might complicate those decisions a lot, though.

>> + }
>> +
>> /*
>> * Checks for costly allocations with __GFP_NORETRY, which
>> * includes THP page fault allocations
>

Next message: Heiko Stuebner: "Re: [PATCH v1 0/3] clk: rockchip: support clock controller for rk3308 SoC"
Previous message: Lorenzo Pieralisi: "Re: [PATCH V4 0/6] PCI: tegra: Enable PCIe C5 controller of Tegra194 in p2972-0000 platform"
In reply to: Michal Hocko: "Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed"
Next in thread: Mike Kravetz: "Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]