Re: [PATCH -mm 2/2] mm: kswapd carefully invoke compaction

From: Mel Gorman
Date: Thu Jan 12 2012 - 11:12:34 EST


On Mon, Jan 09, 2012 at 09:33:57PM -0500, Rik van Riel wrote:
> With CONFIG_COMPACTION enabled, kswapd does not try to free
> contiguous free pages, even when it is woken for a higher order
> request.
>
> This could be bad for eg. jumbo frame network allocations, which
> are done from interrupt context and cannot compact memory themselves.
> Higher than before allocation failure rates in the network receive
> path have been observed in kernels with compaction enabled.
>
> Teach kswapd to defragment the memory zones in a node, but only
> if required and compaction is not deferred in a zone.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>

heh, this is not the first time we had kswapd doing compaction. It was a
bit of a disaster the last time around but a lot has changed since.

Care is needed here though. There are patches in-bound that implement
sync-light. Part of that series includes a fix that makes compact_node
use sync compaction. This is fine for a process writing to /proc, not to
fine for kswapd. Watch for that.

> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index bb2bbdb..df31dab 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -23,6 +23,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
> extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> int order, gfp_t gfp_mask, nodemask_t *mask,
> bool sync);
> +extern int compact_pgdat(pg_data_t *pgdat, int order, bool force);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
>
> /* Do not skip compaction more than 64 times */
> @@ -62,6 +63,11 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> return COMPACT_CONTINUE;
> }
>
> +static inline int compact_pgdat(pg_data_t *pgdat, int order, bool force)
> +{
> + return COMPACT_CONTINUE;
> +}
> +
> static inline unsigned long compaction_suitable(struct zone *zone, int order)
> {
> return COMPACT_SKIPPED;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 1253d7a..1962a0e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -651,16 +651,11 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>
>
> /* Compact all zones within a node */
> -static int compact_node(int nid)
> +int compact_pgdat(pg_data_t *pgdat, int order, bool force)

force is a bad parameter to have as a VM-internal API to kswapd. By
and large I'm trying to move away from these sort of bools towards
compact_mode of sync, sync-light or async.

I would much prefer if what was exposed to kswapd was a compact_pgdat()
that created a suitable compact_control (async compaction I would
expect) and pass that to a static __compact_pgdat() that is shared by
both kswapd and compact_nodes.

That would stop spreading too much knowledge of compaction to kswapd.

> {
> int zoneid;
> - pg_data_t *pgdat;
> struct zone *zone;
>
> - if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
> - return -EINVAL;
> - pgdat = NODE_DATA(nid);
> -
> /* Flush pending updates to the LRU lists */
> lru_add_drain_all();
>
> @@ -668,7 +663,7 @@ static int compact_node(int nid)
> struct compact_control cc = {
> .nr_freepages = 0,
> .nr_migratepages = 0,
> - .order = -1,
> + .order = order,
> };
>
> zone = &pgdat->node_zones[zoneid];

Just to be clear, there is a fix that applies to here that initialises
sync and that would have caused problems for this patch.

> @@ -679,7 +674,8 @@ static int compact_node(int nid)
> INIT_LIST_HEAD(&cc.freepages);
> INIT_LIST_HEAD(&cc.migratepages);
>
> - compact_zone(zone, &cc);
> + if (force || !compaction_deferred(zone))
> + compact_zone(zone, &cc);
>
> VM_BUG_ON(!list_empty(&cc.freepages));
> VM_BUG_ON(!list_empty(&cc.migratepages));
> @@ -688,6 +684,16 @@ static int compact_node(int nid)
> return 0;
> }
>
> +static int compact_node(int nid)
> +{
> + pg_data_t *pgdat;
> + if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
> + return -EINVAL;
> + pgdat = NODE_DATA(nid);
> +
> + return compact_pgdat(pgdat, -1, true);
> +}
> +
> /* Compact all nodes in the system */
> static int compact_nodes(void)
> {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0fb469a..65bf21db 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2522,6 +2522,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> int priority;
> int i;
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> + int zones_need_compaction = 1;
> unsigned long total_scanned;
> struct reclaim_state *reclaim_state = current->reclaim_state;
> unsigned long nr_soft_reclaimed;
> @@ -2788,9 +2789,17 @@ out:
> goto loop_again;
> }
>
> + /* Check if the memory needs to be defragmented. */
> + if (zone_watermark_ok(zone, order,
> + low_wmark_pages(zone), *classzone_idx, 0))
> + zones_need_compaction = 0;
> +
> /* If balanced, clear the congested flag */
> zone_clear_flag(zone, ZONE_CONGESTED);
> }
> +
> + if (zones_need_compaction)
> + compact_pgdat(pgdat, order, false);

hmm, not fully convinced this is the right way of deciding if compaction
should be used. If patch 1 used compaction_suitable() to decide when to
stop reclaiming instead of a watermark check, we could then call
compact_pgdat() in the exit path to balance_pgdat(). That would minimise
the risk that kswapd gets stuck in compaction when it should be
reclaiming pages. What do you think?

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/