Re: [PATCH] mm: page_alloc: Reset fair zone allocation policy only when batch counts are expired

From: Johannes Weiner
Date: Thu May 29 2014 - 10:38:51 EST


Hi Mel!

On Thu, May 29, 2014 at 10:04:32AM +0100, Mel Gorman wrote:
> The fair zone allocation policy round-robins allocations between zones on
> a node to avoid age inversion problems during reclaim using a counter to
> manage the round-robin. If the first allocation fails, the batch counts get
> reset and the allocation is attempted again before going into the slow path.
> There are at least two problems with this
>
> 1. If the eligible zones are below the low watermark we reset the counts
> even though the batches might be fine.

The idea behind setting the batches to high-low was that they should
be roughly exhausted by the time the low watermark is hit. And that
misconception must be the crux of this patch, because if they *were*
to exhaust together this patch wouldn't make a difference.

But once they diverge, we reset the batches prematurely, which means
not everybody is getting their fair share, and that reverts us back to
an imbalance in zone utilization.

So I think the changelog should include why this assumption was wrong.

> 2. We potentially do batch resets even when the right choice is to fallback
> to other nodes.

We only fall back to other nodes when the fairness cycle is over and
all local zones have been considered fair and square. Why *not* reset
the batches and start a new fairness cycle at this point? Remember
that remote nodes are not - can not - be part of the fairness cycle.

So I think this one is a red herring.

> When resetting batch counts, it was expected that the count would be <=
> 0 but the bizarre side-effect is that we are resetting counters that were
> initially postive so (high - low - batch) potentially sets a high positive
> batch count to close to 0. This leads to a premature reset in the near
> future, more overhead and more ... screwing around.

We're just adding the missing delta between the "should" and "is"
value to the existing batch, so a high batch value means small delta,
and we *add* a value close to 0, we don't *set* the batch close to 0.

I think this one is a red herring as well.

> The user-visible effect depends on zone sizes and a host of other effects
> the obvious one is that single-node machines with multiple zones will see
> degraded performance for streaming readers at least. The effect is also
> visible on NUMA machines but it may be harder to identify in the midst of
> other noise.
>
> Comparison is tiobench with data size 2*RAM on ext3 on a small single-node
> machine and on an ext3 filesystem. Baseline kernel is mmotm with the
> shrinker and proportional reclaim patches on top.
>
> 3.15.0-rc5 3.15.0-rc5
> mmotm-20140528 fairzone-v1r1
> Mean SeqRead-MB/sec-1 120.95 ( 0.00%) 133.59 ( 10.45%)
> Mean SeqRead-MB/sec-2 100.81 ( 0.00%) 113.61 ( 12.70%)
> Mean SeqRead-MB/sec-4 93.75 ( 0.00%) 104.75 ( 11.74%)
> Mean SeqRead-MB/sec-8 85.35 ( 0.00%) 91.21 ( 6.86%)
> Mean SeqRead-MB/sec-16 68.91 ( 0.00%) 74.77 ( 8.49%)
> Mean RandRead-MB/sec-1 1.08 ( 0.00%) 1.07 ( -0.93%)
> Mean RandRead-MB/sec-2 1.28 ( 0.00%) 1.25 ( -2.34%)
> Mean RandRead-MB/sec-4 1.54 ( 0.00%) 1.51 ( -1.73%)
> Mean RandRead-MB/sec-8 1.67 ( 0.00%) 1.70 ( 2.20%)
> Mean RandRead-MB/sec-16 1.74 ( 0.00%) 1.73 ( -0.19%)
> Mean SeqWrite-MB/sec-1 113.73 ( 0.00%) 113.88 ( 0.13%)
> Mean SeqWrite-MB/sec-2 103.76 ( 0.00%) 104.13 ( 0.36%)
> Mean SeqWrite-MB/sec-4 98.45 ( 0.00%) 98.44 ( -0.01%)
> Mean SeqWrite-MB/sec-8 93.11 ( 0.00%) 92.79 ( -0.34%)
> Mean SeqWrite-MB/sec-16 87.64 ( 0.00%) 87.85 ( 0.24%)
> Mean RandWrite-MB/sec-1 1.38 ( 0.00%) 1.36 ( -1.21%)
> Mean RandWrite-MB/sec-2 1.35 ( 0.00%) 1.35 ( 0.25%)
> Mean RandWrite-MB/sec-4 1.33 ( 0.00%) 1.35 ( 1.00%)
> Mean RandWrite-MB/sec-8 1.31 ( 0.00%) 1.29 ( -1.53%)
> Mean RandWrite-MB/sec-16 1.27 ( 0.00%) 1.28 ( 0.79%)
>
> Streaming readers see a huge boost. Random random readers, sequential
> writers and random writers are all in the noise.

Impressive, but I would really like to understand what's going on
there.

Did you record the per-zone allocation numbers by any chance as well,
so we can see the difference in zone utilization?

> Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
> ---
> mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------
> 1 file changed, 47 insertions(+), 42 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2c7d394..70d4264 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1919,6 +1919,28 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>
> #endif /* CONFIG_NUMA */
>
> +static void reset_alloc_batches(struct zonelist *zonelist,
> + enum zone_type high_zoneidx,
> + struct zone *preferred_zone)
> +{
> + struct zoneref *z;
> + struct zone *zone;
> +
> + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> + /*
> + * Only reset the batches of zones that were actually
> + * considered in the fairness pass, we don't want to
> + * trash fairness information for zones that are not
> + * actually part of this zonelist's round-robin cycle.
> + */
> + if (!zone_local(preferred_zone, zone))
> + continue;
> + mod_zone_page_state(zone, NR_ALLOC_BATCH,
> + high_wmark_pages(zone) - low_wmark_pages(zone) -
> + atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> + }
> +}
> +
> /*
> * get_page_from_freelist goes through the zonelist trying to allocate
> * a page.
> @@ -1936,6 +1958,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> int did_zlc_setup = 0; /* just call zlc_setup() one time */
> bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
> (gfp_mask & __GFP_WRITE);
> + bool batch_depleted = (alloc_flags & ALLOC_FAIR);
>
> zonelist_scan:
> /*
> @@ -1960,11 +1982,13 @@ zonelist_scan:
> * time the page has in memory before being reclaimed.
> */
> if (alloc_flags & ALLOC_FAIR) {
> - if (!zone_local(preferred_zone, zone))
> - continue;
> if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> continue;
> + batch_depleted = false;
> + if (!zone_local(preferred_zone, zone))
> + continue;

This only resets the local batches once the first non-local zone's
batch is exhausted as well. Which means that once we start spilling,
the fairness pass will never consider local zones again until the
first spill-over target is exhausted too. But no remote allocs are
allowed during the fairness cycle, so you're creating a pass over the
zonelist where only known-exhausted local zones are considered.

What's going on there?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/