Re: [RFC PATCH 3/4] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL

From: Tetsuo Handa
Date: Wed Mar 08 2017 - 06:24:00 EST


On 2017/03/08 0:48, Michal Hocko wrote:
> From: Michal Hocko <mhocko@xxxxxxxx>
>
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
>
> Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made.
>
> Cc: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
> ---
> fs/xfs/kmem.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ae08cfd9552a..ac80a4855c83 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
> lflags &= ~__GFP_FS;
> }
>
> + /*
> + * Default page/slab allocator behavior is to retry for ever
> + * for small allocations. We can override this behavior by using
> + * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> + * as it is feasible but rather fail than retry for ever for all
> + * request sizes.
> + */
> + if (flags & KM_MAYFAIL)
> + lflags |= __GFP_RETRY_MAYFAIL;

I don't see advantages of supporting both __GFP_NORETRY and __GFP_RETRY_MAYFAIL.
kmem_flags_convert() can always set __GFP_NORETRY because the callers use
opencoded __GFP_NOFAIL loop (with possible allocation lockup warning) unless
KM_MAYFAIL is set.

> +
> if (flags & KM_ZERO)
> lflags |= __GFP_ZERO;
>
>

Well, commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the
allocator slowpath") unexpectedly changed to always give up without using
memory reserves (unless __GFP_NOFAIL is set) if TIF_MEMDIE is set to current
thread when current thread is inside __alloc_pages_may_oom() (precisely speaking,
if TIF_MEMDIE is set when current thread is after

if (gfp_pfmemalloc_allowed(gfp_mask))
alloc_flags = ALLOC_NO_WATERMARKS;

line and before

/* Avoid allocations with no watermarks from looping endlessly */
if (test_thread_flag(TIF_MEMDIE))
goto nopage;

line, which is likely always true); but this is off-topic for this thread.

The lines which are executed only when __GFP_RETRY_MAYFAIL is set rather than
__GFP_NORETRY is set are

/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;

/*
* Do not retry costly high order allocations unless they are
* __GFP_RETRY_MAYFAIL
*/
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_RETRY_MAYFAIL))
goto nopage;

if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;

/*
* It doesn't make any sense to retry for the compaction if the order-0
* reclaim is not able to make any progress because the current
* implementation of the compaction depends on the sufficient amount
* of free memory (see __compaction_suitable)
*/
if (did_some_progress > 0 &&
should_compact_retry(ac, order, alloc_flags,
compact_result, &compact_priority,
&compaction_retries))
goto retry;

/*
* It's possible we raced with cpuset update so the OOM would be
* premature (see below the nopage: label for full explanation).
*/
if (read_mems_allowed_retry(cpuset_mems_cookie))
goto retry_cpuset;

/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
goto got_pg;

__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
const struct alloc_context *ac, unsigned long *did_some_progress)
{
struct oom_control oc = {
.zonelist = ac->zonelist,
.nodemask = ac->nodemask,
.memcg = NULL,
.gfp_mask = gfp_mask,
.order = order,
};
struct page *page;

*did_some_progress = 0;

/*
* Acquire the oom lock. If that fails, somebody else is
* making progress for us.
*/
if (!mutex_trylock(&oom_lock)) {
*did_some_progress = 1;
schedule_timeout_uninterruptible(1);
return NULL;
}

/*
* Go through the zonelist yet one more time, keep very high watermark
* here, this is only to catch a parallel oom killing, we must fail if
* we're still under heavy pressure.
*/
page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
if (page)
goto out;

/* Coredumps can quickly deplete all memory reserves */
if (current->flags & PF_DUMPCORE)
goto out;
/* The OOM killer will not help higher order allocs */
if (order > PAGE_ALLOC_COSTLY_ORDER)
goto out;
/*
* We have already exhausted all our reclaim opportunities without any
* success so it is time to admit defeat. We will skip the OOM killer
* because it is very likely that the caller has a more reasonable
* fallback than shooting a random task.
*/
if (gfp_mask & __GFP_RETRY_MAYFAIL)
goto out;

where both __GFP_NORETRY and __GFP_RETRY_MAYFAIL are checked after
direct reclaim and compaction failed. __GFP_RETRY_MAYFAIL optimistically
retries based on one of should_reclaim_retry() or should_compact_retry()
or read_mems_allowed_retry() returns true or mutex_trylock(&oom_lock) in
__alloc_pages_may_oom() returns 0. If !__GFP_FS allocation requests are
holding oom_lock each other, __GFP_RETRY_MAYFAIL allocation requests (which
are likely !__GFP_FS allocation requests due to __GFP_FS allocation requests
being blocked on direct reclaim) can be blocked for uncontrollable duration
without making progress. It seems to me that the difference between
__GFP_NORETRY and __GFP_RETRY_MAYFAIL is not useful. Rather, the caller can
set __GFP_NORETRY and retry with any control (e.g. set __GFP_HIGH upon first
timeout, give up upon second timeout).