Re: [RFC 1/4] mm, compaction: introduce kcompactd

From: Mel Gorman
Date: Thu Jul 30 2015 - 06:58:13 EST


On Thu, Jul 02, 2015 at 10:46:32AM +0200, Vlastimil Babka wrote:
> Memory compaction can be currently performed in several contexts:
>
> - kswapd balancing a zone after a high-order allocation failure

Which potentially was a problem that's hard to detect.

> - direct compaction to satisfy a high-order allocation, including THP page
> fault attemps
> - khugepaged trying to collapse a hugepage
> - manually from /proc
>
> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
> (pending or future) high-order allocation, and is easy to evaluate. The other
> purpose is to keep overal memory fragmentation low and help the
> anti-fragmentation mechanism. The success wrt the latter purpose is more
> difficult to evaluate.
>

The latter would be very difficult to measure. It would have to be shown
that the compaction took movable pages from a pageblock assigned to
unmovable or reclaimable pages and that the action prevented a pageblock
being stolen. You'd have to track all allocation/frees and compaction
events and run it through a simulator. Even then, it'd prove/disprove it
in a single case.

The "obvious purpose" is sufficient justification IMO.

> The current situation wrt the purposes has a few drawbacks:
>
> - compaction is invoked only when a high-order page or hugepage is not
> available (or manually). This might be too late for the purposes of keeping
> memory fragmentation low.

Yep. The other side of the coin is that time can be spent compacting for
a non-existent user so there may be demand for tuning.

> - direct compaction increases latency of allocations. Again, it would be
> better if compaction was performed asynchronously to keep fragmentation low,
> before the allocation itself comes.

Definitely. Ideally direct compaction stalls would never occur unless the
caller absolutely requires it.

> - (a special case of the previous) the cost of compaction during THP page
> faults can easily offset the benefits of THP.
>
> To improve the situation, we need an equivalent of kswapd, but for compaction.
> E.g. a background thread which responds to fragmentation and the need for
> high-order allocations (including hugepages) somewhat proactively.
>
> One possibility is to extend the responsibilities of kswapd, which could
> however complicate its design too much. It should be better to let kswapd
> handle reclaim, as order-0 allocations are often more critical than high-order
> ones.
>

Agreed. Kswapd compacting can cause a direct reclaim stall for order-0. One
motivation for kswapd doing the compaction was for high-order atomic
allocation failures. At the risk of distracting from this series, the
requirement for high-order atomic allocations is better served by "Remove
zonelist cache and high-order watermark checking" than kswapd running
compaction. kcompactd would have a lot of value for both THP and allowing
high-atomic reserves to grow quickly if necessary

> Another possibility is to extend khugepaged, but this kthread is a single
> instance and tied to THP configs.
>
> This patch goes with the option of a new set of per-node kthreads called
> kcompactd, and lays the foundations. The lifecycle mimics kswapd kthreads.
>
> The work loop of kcompactd currently mimics an pageblock-order direct
> compaction attempt each 15 seconds. This might not be enough to keep
> fragmentation low, and needs evaluation.
>

You could choose to adapt the rate based on the number of high-order
requests that entered the slow path. Initially I would not try though,
keep it simple first.

> When there's not enough free memory for compaction, kswapd is woken up for
> reclaim only (not compaction/reclaim).
>
> Further patches will add the ability to wake up kcompactd on demand in special
> situations such as when hugepages are not available, or when a fragmentation
> event occured.
>
> Not-yet-signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
> ---
> include/linux/compaction.h | 11 +++
> include/linux/mmzone.h | 4 ++
> mm/compaction.c | 173 +++++++++++++++++++++++++++++++++++++++++++++
> mm/memory_hotplug.c | 15 ++--
> mm/page_alloc.c | 3 +
> 5 files changed, 201 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index aa8f61c..a2525d8 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -51,6 +51,9 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> bool alloc_success);
> extern bool compaction_restarting(struct zone *zone, int order);
>
> +extern int kcompactd_run(int nid);
> +extern void kcompactd_stop(int nid);
> +
> #else
> static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
> unsigned int order, int alloc_flags,
> @@ -83,6 +86,14 @@ static inline bool compaction_deferred(struct zone *zone, int order)
> return true;
> }
>
> +static int kcompactd_run(int nid)
> +{
> + return 0;
> +}
> +extern void kcompactd_stop(int nid)
> +{
> +}
> +
> #endif /* CONFIG_COMPACTION */
>
> #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 54d74f6..bc96a23 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -762,6 +762,10 @@ typedef struct pglist_data {
> /* Number of pages migrated during the rate limiting time interval */
> unsigned long numabalancing_migrate_nr_pages;
> #endif
> +#ifdef CONFIG_COMPACTION
> + struct task_struct *kcompactd;
> + wait_queue_head_t kcompactd_wait;
> +#endif
> } pg_data_t;
>
> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 018f08d..fcbc093 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -17,6 +17,8 @@
> #include <linux/balloon_compaction.h>
> #include <linux/page-isolation.h>
> #include <linux/kasan.h>
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
> #include "internal.h"
>
> #ifdef CONFIG_COMPACTION
> @@ -29,6 +31,10 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
> {
> count_vm_events(item, delta);
> }
> +
> +//TODO: add tuning knob
> +static unsigned int kcompactd_sleep_millisecs __read_mostly = 15000;
> +
> #else
> #define count_compact_event(item) do { } while (0)
> #define count_compact_events(item, delta) do { } while (0)

Leave the tuning knob out unless it is absolutely required. kcompactd
may eventually decide that a time-based heuristic for wakeups is not
enough. Minimally add a mechanism that wakes kcompactd up in response to
allocation failures to how wakeup_kswapd gets called. An alternative
would be to continue waking kswapd as normal, have kswapd only reclaim
order-0 pages as part of the reclaim/compaction phase and wake kcompactd
when the reclaim phase is complete. If kswapd decides that no reclaim is
necessary then kcompactd gets woken immediately.

khugepaged would then kick kswapd through the normal mechanism and
potentially avoid direct compaction. On NUMA machines, it could keep
scanning to see if there is another node whose pages can be collapsed. On
UMA, it could just pause immediately and wait for kswapd and kcompactd to
do something useful.

There will be different opinions on periodic compaction but to be honest,
periodic compaction also could be implemented from userspace using the
compact_node sysfs files. The risk with periodic compaction is that it
can cause stalls in applications that do not care if they fault the pages
being migrated. This may happen even though there are zero requirements
for high-order pages from anybody.

> @@ -1714,4 +1720,171 @@ void compaction_unregister_node(struct node *node)
> }
> #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>
> +/*
> + * Has any special work been requested of kcompactd?
> + */
> +static bool kcompactd_work_requested(pg_data_t *pgdat)
> +{
> + return false;
> +}
> +
> +static void kcompactd_do_work(pg_data_t *pgdat)
> +{
> + /*
> + * //TODO: smarter decisions on how much to compact. Using pageblock
> + * order might result in no compaction, until fragmentation builds up
> + * too much. Using order -1 could be too aggressive on large zones.
> + *

You could consider using pgdat->kswapd_max_order? That thing is meant
to help kswapd decide what order is required by callers at the moment.
Again, kswapd does the order-0 reclaim and then wakes kcompactd with the
kswapd_max_order as a parameter.

> + * With no special task, compact all zones so that a pageblock-order
> + * page is allocatable. Wake up kswapd if there's not enough free
> + * memory for compaction.
> + */
> + int zoneid;
> + struct zone *zone;
> + struct compact_control cc = {
> + .order = pageblock_order,
> + .mode = MIGRATE_SYNC,
> + .ignore_skip_hint = true,
> + };
> +
> + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> +
> + int suitable;
> +
> + zone = &pgdat->node_zones[zoneid];
> + if (!populated_zone(zone))
> + continue;
> +
> + suitable = compaction_suitable(zone, cc.order, 0, 0);
> +
> + if (suitable == COMPACT_SKIPPED) {
> + /*
> + * We pass order==0 to kswapd so it doesn't compact by
> + * itself. We just need enough free pages to proceed
> + * with compaction here on next kcompactd wakeup.
> + */
> + wakeup_kswapd(zone, 0, 0);
> + continue;
> + }

I think it makes more sense that kswapd kicks kcompactd than the other
way around.

Overall, I like the idea.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/