Re: [PATCH] mm, page_alloc: fix build_zonerefs_node()

From: Michal Hocko
Date: Thu Apr 07 2022 - 09:24:07 EST


On Thu 07-04-22 14:12:38, David Hildenbrand wrote:
> On 07.04.22 14:04, Michal Hocko wrote:
> > On Thu 07-04-22 13:58:44, David Hildenbrand wrote:
> > [...]
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 3589febc6d31..130a2feceddc 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
> >>> do {
> >>> zone_type--;
> >>> zone = pgdat->node_zones + zone_type;
> >>> - if (managed_zone(zone)) {
> >>> - zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> >>> - check_highest_zone(zone_type);
> >>> - }
> >>> + zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> >>> + check_highest_zone(zone_type);
> >>> } while (zone_type);
> >>>
> >>> return nr_zones;
> >>
> >> I don't think having !populated zones in the zonelist is a particularly
> >> good idea. Populated vs !populated changes only during page
> >> onlininge/offlining.
> >>
> >> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ...
> >
> > What kind of problem that would cause? The allocator wouldn't see any
> > pages at all so it would fallback to the next one. Maybe kswapd would
> > need some tweak to have a bail out condition but as mentioned in the
> > thread already. !populated or !managed for that matter are not all that
> > much different from completely depleted zones. The fact that we are
> > making that distinction has led to some bugs and I suspect it makes the
> > code more complex without a very good reason.
>
> I assume performance problems. Assume you have an ordinary system with
> multiple NUMA nodes and no MOVABLE memory. Most nodes will only have
> ZONE_NORMAL. Yet, you'd include ZONE_DMA* and ZONE_MOVABLE that will
> always remain empty to be traversed on each and every allocation
> fallback. Of course, we could measure, but IMHO at least *that* part of
> memory onlining/offlining is not the complicated part :D

You've got a good point here. I guess there will be usecases that really
benefit from every single CPU cycle spared in the allocator hot path
which really depends on the zonelists traversing.

I have very briefly had a look at our remaining usage of managed_zone()
and there are not that many left. We have 2 in page_alloc.c via
has_managed_dma(). I guess we could drop that one and use __GFP_NOWARN
in dma_atomic_pool_init but this is nothing really earth shattering.
The remaining occurances are in vmscan.c:
- should_continue_reclaim, pgdat_balanced - required because
they iterate all zones withing the zoneindex and need to
decide whether they are balanced or not. We can make empty
zones special case and make them always balanced
- kswapd_shrink_node - required because we would be increasing
reclaim target for empty zones
- update_reclaim_active - required because we do not want to
alter zone state if it is not a subject of the reclaim which
empty zones are not by definition.
- balance_pgdat - first check is likely a microoptimization,
reclaim_idx is needed to have a populated zone there
- wakeup_kswapd - I dunno
- shrink_node, allow_direct_reclaim, lruvec_lru_size - microptimizations
- pgdat_watermark_boosted - microptimizations I suspect as empty
zone shouldn't ever get watermark_boost
- pgdat_balanced - functionally dd

So we can get rid of quite some but we will still need some of them.
--
Michal Hocko
SUSE Labs