Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

From: David Rientjes
Date: Thu Sep 26 2019 - 15:03:52 EST


On Wed, 25 Sep 2019, Michal Hocko wrote:

> I am especially interested about this part. The more I think about this
> the more I am convinced that the underlying problem really is in the pre
> mature fallback in the fast path.

I appreciate you taking the time to continue to look at this but I'm
confused about the underlying problem you're referring to: we had no
underlying problem until 5.3 was released so we need to carry patches that
will revert this behavior (we simply can't tolerate double digit memory
access latency regressions lol).

If you're referring to post-5.3 behavior, this appears to override
alloc_hugepage_direct_gfpmask() but directly in the page allocator.

static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
{
...
/*
* __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
* specified, to express a general desire to stay on the current

Your patch is setting __GFP_THISNODE for __GFP_DIRECT_RECLAIM: this
allocation will fail in the fastpath for both my case (fragmented local
node) and Andrea's case (out of memory local node). The first
get_page_from_freelist() will then succeed in the slowpath for both cases;
compaction is not tried for either.

In my case, that results in a perpetual remote access latency that we
can't tolerate. If Andrea's remote nodes are fragmented or low on memory,
his case encounters swap storms over both the local node and remote nodes.

So I'm not really sure what is solved by your patch?

We're not on 5.3, but I can try it and collect data on exactly how poorly
it performs on fragmented *hosts* (not single nodes, but really the whole
system) because swap storms were never fixed here, it was only papered
over.

> Does the almost-patch below helps your
> workload? It effectively reduces the fast path for higher order
> allocations to the local/requested node. The justification is that
> watermark check might be too strict for those requests as it is primary
> order-0 oriented. Low watermark target simply has no meaning for the
> higher order requests AFAIU. The min-low gap is giving kswapd a chance
> to balance and be more local node friendly while we do not have anything
> like that in compaction.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ff5484fdbdf9..09036cf55fca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> {
> struct page *page;
> unsigned int alloc_flags = ALLOC_WMARK_LOW;
> - gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> + gfp_t fastpath_mask, alloc_mask; /* The gfp_t that was actually used for allocation */
> struct alloc_context ac = { };
>
> /*
> @@ -4698,7 +4698,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> }
>
> gfp_mask &= gfp_allowed_mask;
> - alloc_mask = gfp_mask;
> + fastpath_mask = alloc_mask = gfp_mask;
> if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> return NULL;
>
> @@ -4710,8 +4710,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> */
> alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);
>
> - /* First allocation attempt */
> - page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
> + /*
> + * First allocation attempt. If we have a high order allocation then do not fall
> + * back to a remote node just based on the watermark check on the requested node
> + * because compaction might easily free up a requested order and then it would be
> + * better to simply go to the slow path.
> + * TODO: kcompactd should help here but nobody has woken it up unless we hit the
> + * slow path so we might need some tuning there as well.
> + */
> + if (order && (gfp_mask & __GFP_DIRECT_RECLAIM))
> + fastpath_mask |= __GFP_THISNODE;
> + page = get_page_from_freelist(fastpath_mask, order, alloc_flags, &ac);
> if (likely(page))
> goto out;