Re: upcoming kerneloops.org item: get_page_from_freelist

From: David Rientjes
Date: Tue Jun 30 2009 - 15:36:34 EST


On Tue, 30 Jun 2009, Mel Gorman wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 175a67a..5f4656e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -230,14 +230,21 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> /*
> * This task already has access to memory reserves and is
> * being killed. Don't allow any other task access to the
> - * memory reserve.
> + * memory reserve unless the current process is the one
> + * selected for OOM-killing. If the current process has
> + * been OOM-killed and we are OOM again, another process
> + * needs to be considered for OOM-kill
> *
> * Note: this may have a chance of deadlock if it gets
> * blocked waiting for another task which itself is waiting
> * for memory. Is there a better alternative?
> */
> - if (test_tsk_thread_flag(p, TIF_MEMDIE))
> - return ERR_PTR(-1UL);
> + if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> + if (p == current)
> + continue;
> + else
> + return ERR_PTR(-1UL);
> + }
>
> /*
> * This is in the process of releasing memory so wait for it

This will panic the machine if current is the only user thread running or
eligible for oom kill (all others could have mm's with OOM_DISABLE set).
Currently, such tasks can exit or kthreads can free memory so that the oom
is recoverable.

The problem with this approach is that it increases the liklihood that
memory reserves will be totally depleted when several threads are
competing for them.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5d714f8..5896469 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1539,6 +1539,14 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Do not loop if OOM-killed unless __GFP_NOFAIL is specified */
> + if (test_thread_flag(TIF_MEMDIE)) {
> + if (gfp_mask & __GFP_NOFAIL)
> + WARN(1, "Potential infinite loop with __GFP_NOFAIL");
> + else
> + return 0;
> + }
> +

There's a second bug in the refactored page allocator: when the oom killer
is invoked and it happens to kill current, alloc_flags is never updated
because it loops back to `restart', which is past gfp_to_alloc_flags().

When that is fixed with the patch below, gfp_mask & __GFP_NOFAIL will
never be true here in your scenario, where the oom killer kills current,
because __alloc_pages_high_priority() will infinitely loop.

> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
>

This is needed for 2.6.31-rc2.


mm: update alloc_flags after oom killer has been called

It is possible for the oom killer to select current as the task to kill.
When this happens, alloc_flags needs to be updated accordingly to set
ALLOC_NO_WATERMARKS so the subsequent allocation attempt may use memory
reserves as the result of its thread having TIF_MEMDIE set if the
allocation is not __GFP_NOMEMALLOC.

Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1756,6 +1756,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

wake_all_kswapd(order, zonelist, high_zoneidx);

+restart:
/*
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
@@ -1763,7 +1764,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
alloc_flags = gfp_to_alloc_flags(gfp_mask);

-restart:
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/