Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic

From: NeilBrown
Date: Wed May 24 2017 - 21:21:19 EST


On Tue, Mar 07 2017, Michal Hocko wrote:

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 2bfcfd33e476..60af7937c6f2 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -25,7 +25,7 @@ struct vm_area_struct;
> #define ___GFP_FS 0x80u
> #define ___GFP_COLD 0x100u
> #define ___GFP_NOWARN 0x200u
> -#define ___GFP_REPEAT 0x400u
> +#define ___GFP_RETRY_MAYFAIL 0x400u
> #define ___GFP_NOFAIL 0x800u
> #define ___GFP_NORETRY 0x1000u
> #define ___GFP_MEMALLOC 0x2000u
> @@ -136,26 +136,38 @@ struct vm_area_struct;
> *
> * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
> *
> - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> - * _might_ fail. This depends upon the particular VM implementation.
> + * The default allocator behavior depends on the request size. We have a concept
> + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).

Boundary conditions is one of my pet peeves....
The description here suggests that an allocation of
"1<<PAGE_ALLOC_COSTLY_ORDER" pages is not "costly", which is
inconsistent with how those words would normally be interpreted.

Looking at the code I see comparisons like:

order < PAGE_ALLOC_COSTLY_ORDER
or
order >= PAGE_ALLOC_COSTLY_ORDER

which supports the documented (but incoherent) meaning.

But I also see:

order = max_t(int, PAGE_ALLOC_COSTLY_ORDER - 1, 0);

which looks like it is trying to perform the largest non-costly
allocation, but is making a smaller allocation than necessary.

I would *really* like it if the constant actually meant what its name
implied.

PAGE_ALLOC_MAX_NON_COSTLY
??

> + * !costly allocations are too essential to fail so they are implicitly
> + * non-failing (with some exceptions like OOM victims might fail) by default while
> + * costly requests try to be not disruptive and back off even without invoking
> + * the OOM killer. The following three modifiers might be used to override some of
> + * these implicit rules
> + *
> + * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> + * return NULL when direct reclaim and memory compaction have failed to allow
> + * the allocation to succeed. The OOM killer is not called with the current
> + * implementation. This is a default mode for costly allocations.

The name here is "NORETRY", but the text says "not retry indefinitely".
So does it retry or not?
I would assuming it "tried" once, and only once.
However it could be that a "try" is not a simple well defined task.
Maybe some escalation happens on the 2nd or 3rd "try", so they are really
trying different things?

The word "indefinitely" implies there is a definite limit. It might
help to say what that is, or at least say that it is small.

Also, this documentation is phrased to tell the VM implementor what is,
or is not, allowed. Most readers will be more interested is the
responsibilities of the caller.

__GFP_NORETRY: The VM implementation will not retry after all
reasonable avenues for finding free memory have been pursued. The
implementation may sleep (i.e. call 'schedule()'), but only while
waiting for another task to perform some specific action.
The caller must handle failure. This flag is suitable when failure can
easily be handled at small cost, such as reduced throughput.


> + *
> + * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
> + * _might_ fail. All viable forms of memory reclaim are tried before the fail.
> + * The OOM killer is excluded because this would be too disruptive. This can be
> + * used to override non-failing default behavior for !costly requests as well as
> + * fortify costly requests.

What does "Try hard" mean?
In part, it means "retry everything a few more times", I guess in the
hope that something happened in the mean time.
It also seems to mean waiting for compaction to happen, which I
guess is only relevant for >PAGE_SIZE allocations?
Maybe it also means waiting for page-out to complete.
So the summary would be that it waits for a little while, hoping for a
miracle.

__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
procedures that have previously failed if there is some indication
that progress has been made else where. It can wait for other
tasks to attempt high level approaches to freeing memory such as
compaction (which removed fragmentation) and page-out.
There is still a definite limit to the number of retries, but it is
a larger limit than with __GFP_NORERY.
Allocations with this flag may fail, but only when there is
genuinely little unused memory. While these allocations do not
directly trigger the OOM killer, their failure indicates that the
system is likely to need to use the OOM killer soon.
The caller must handle failure, but can reasonably do so by failing
a higher-level request, or completing it only in a much less
efficient manner.
If the allocation does fail, and the caller is in a position to
free some non-essential memory, doing so could benefit the system
as a whole.



> *
> * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> * cannot handle allocation failures. New users should be evaluated carefully
> * (and the flag should be used only when there is no reasonable failure
> * policy) but it is definitely preferable to use the flag rather than
> - * opencode endless loop around allocator.
> - *
> - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> - * return NULL when direct reclaim and memory compaction have failed to allow
> - * the allocation to succeed. The OOM killer is not called with the current
> - * implementation.
> + * opencode endless loop around allocator. Using this flag for costly allocations
> + * is _highly_ discouraged.

Should this explicitly say that the OOM killer might be invoked in an attempt
to satisfy this allocation? Is the OOM killer *only* invoked from
allocations with __GFP_NOFAIL ?
Maybe be extra explicit "The allocation could block indefinitely but
will never return with failure. Testing for failure is pointless.".


I've probably got several specifics wrong. I've tried to answer the
questions that I would like to see answered by the documentation. If
you can fix it up so that those questions are answered correctly, that
would be great.

Thanks,
NeilBrown

Attachment: signature.asc
Description: PGP signature