Re: [PATCH 06/10] mm: page_alloc: Distinguish between being unable to sleep, unwilling to unwilling and avoiding waking kswapd

From: Mel Gorman
Date: Fri Aug 21 2015 - 16:40:28 EST


On Fri, Aug 21, 2015 at 03:42:21PM +0200, Vlastimil Babka wrote:
> On 08/12/2015 12:45 PM, Mel Gorman wrote:
> >__GFP_WAIT has been used to identify atomic context in callers that hold
> >spinlocks or are in interrupts. They are expected to be high priority and
> >have access one of two watermarks lower than "min". __GFP_HIGH users get
> >access to the first lower watermark and can be called the "high priority
> >reserve". Atomic users and interrupts access yet another lower watermark
> >that can be called the "atomic reserve".
> >
> >Over time, callers had a requirement to not block when fallback options
> >were available. Some have abused __GFP_WAIT leading to a situation where
> >an optimisitic allocation with a fallback option can access atomic reserves.
> >
> >This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> >cannot sleep and have no alternative. High priority users continue to use
> >__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> >willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> >that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> >as a caller that is willing to enter direct reclaim and wake kswapd for
> >background reclaim.
> >
> >This patch then converts a number of sites
> >
> >o __GFP_ATOMIC is used by callers that are high priority and have memory
> > pools for those requests. GFP_ATOMIC uses this flag. Callers with
> > interrupts disabled still automatically use the atomic reserves.
>
> Hm I can't see where the latter happens? In gfp_to_alloc_flags(),
> ALLOC_HARDER is set for __GFP_ATOMIC, or rt-tasks *not* in interrupt? What
> am I missing?
>

It was a mistake from an earlier version of the patch that was itself
buggy. I forgot to fix the changelog properly.

> >o Callers that have a limited mempool to guarantee forward progress use
> > __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
> > kswapd will still be woken but atomic reserves are not used as there
> > is a one-entry mempool to guarantee progress.
> >
> >o Callers that are checking if they are non-blocking should use the
> > helper gfpflags_allows_blocking() where possible. This is because
>
> A bit subjective but gfpflags_allow_blocking() sounds better to me.
> Or shorter gfp_allows_blocking()?
>

I'll use gfpflags_allow_blocking.

> > checking for __GFP_WAIT as was done historically now can trigger false
> > positives. Some exceptions like dm-crypt.c exist where the code intent
> > is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
> > flag manipulations.
> >
> >The key hazard to watch out for is callers that removed __GFP_WAIT and
> >was depending on access to atomic reserves for inconspicuous reasons.
> >In some cases it may be appropriate for them to use __GFP_HIGH.
>
> Hm we might also have a (non-fatal) hazard of callers that directly combined
> __GFP_* flags that didn't include __GFP_WAIT, but did wake up kswapd, and
> now might be missing __GFP_KSWAPD_RECLAIM. Did you try checking for those? I
> imagine it's not a simple task...
>

I hadn't searched but there are a small number of callers that
potentially care. I fixed them.

> >Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
>
> >
> >diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> >index c46e68cf9344..6f1f6fae30f5 100644
> >--- a/Documentation/vm/balance
> >+++ b/Documentation/vm/balance
> >@@ -1,12 +1,14 @@
> > Started Jan 2000 by Kanoj Sarcar <kanoj@xxxxxxx>
> >
> >-Memory balancing is needed for non __GFP_WAIT as well as for non
> >-__GFP_IO allocations.
> >+Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> >+well as for non __GFP_IO allocations.
> >
> >-There are two reasons to be requesting non __GFP_WAIT allocations:
> >-the caller can not sleep (typically intr context), or does not want
> >-to incur cost overheads of page stealing and possible swap io for
> >-whatever reasons.
> >+The first reason why a caller may avoid reclaim is that the caller can not
> >+sleep due to holding a spinlock or is in interrupt context. The second may
> >+be that the caller is willing to fail the allocation without incurring the
> >+overhead of page stealing. This may happen for opportunistic high-order
>
> I think "page stealing" has nowadays a different meaning in the
> anti-fragmentation context? Should it just say "reclaim"?
>

Good point, corrected.

> >+allocation requests that have order-0 fallback options. In such cases,
> >+the caller may also wish to avoid waking kswapd.
> >
> > __GFP_IO allocation requests are made to prevent file system deadlocks.
> >
> >diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> >index cba12f34ff77..100d3fbaebae 100644
> >--- a/arch/arm/mm/dma-mapping.c
> >+++ b/arch/arm/mm/dma-mapping.c
> >@@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
> >
> > if (is_coherent || nommu())
> > addr = __alloc_simple_buffer(dev, size, gfp, &page);
> >- else if (!(gfp & __GFP_WAIT))
> >+ else if (gfp & __GFP_ATOMIC)
> > addr = __alloc_from_pool(size, &page);
> > else if (!dev_get_cma_area(dev))
> > addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
> >@@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
> > *handle = DMA_ERROR_CODE;
> > size = PAGE_ALIGN(size);
> >
> >- if (!(gfp & __GFP_WAIT))
> >+ if (gfp & __GFP_ATOMIC)
> > return __iommu_alloc_atomic(dev, size, handle);
> >
> > /*
> >diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> >index d16a1cead23f..713d963fb96b 100644
> >--- a/arch/arm64/mm/dma-mapping.c
> >+++ b/arch/arm64/mm/dma-mapping.c
> >@@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
> > if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> > dev->coherent_dma_mask <= DMA_BIT_MASK(32))
> > flags |= GFP_DMA;
> >- if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
> >+ if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_DIRECT_RECLAIM)) {
> > struct page *page;
> > void *addr;
> >
> >@@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
> >
> > size = PAGE_ALIGN(size);
> >
> >- if (!coherent && !(flags & __GFP_WAIT)) {
> >+ if (!coherent && (flags & __GFP_ATOMIC)) {
> > struct page *page = NULL;
> > void *addr = __alloc_from_pool(size, &page, flags);
> >
>
> Hmm these change the lack of __GFP_WAIT to expect __GFP_ATOMIC, so it's
> potentially one of those "key hazards" mentioned in the changelog, right?
> But here it's not just about using atomic reserves, but using a completely
> different allocation function.
> E.g. in case of arch/arm/mm/dma-mapping.c:__dma_alloc() I see it can go to
> __alloc_remap_buffer -> __dma_alloc_remap -> dma_common_contiguous_remap
> which does kmalloc(..., GFP_KERNEL) and has comment "Cannot be used in
> non-sleeping contexts".
>

I completely missed that. It needs to be a check for
__GFP_DIRECT_RECLAIM and similar in the other arm file.

> So I think callers that cannot sleep and did clear __GFP_WAIT before, are
> now dangerous unless they set __GFP_ATOMIC?
>
> >diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> >index 2a3973a7c441..dc611c8cad10 100644
> >--- a/drivers/firewire/core-cdev.c
> >+++ b/drivers/firewire/core-cdev.c
> >@@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
> > static int add_client_resource(struct client *client,
> > struct client_resource *resource, gfp_t gfp_mask)
> > {
> >- bool preload = !!(gfp_mask & __GFP_WAIT);
> >+ bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> Use the helper here to avoid !! as a bonus?
>

Done.

> >--- a/drivers/infiniband/core/sa_query.c
> >+++ b/drivers/infiniband/core/sa_query.c
> >@@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
> >
> > static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
> > {
> >- bool preload = !!(gfp_mask & __GFP_WAIT);
> >+ bool preload = !!(gfp_mask & __GFP_DIRECT_RECLAIM);
> > unsigned long flags;
> > int ret, id;
> >
>
> Same here.
>

Done.

> >diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> >index d51687780b61..06badad3ab75 100644
> >--- a/drivers/usb/host/u132-hcd.c
> >+++ b/drivers/usb/host/u132-hcd.c
> >@@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
> > {
> > struct u132 *u132 = hcd_to_u132(hcd);
> > if (irqs_disabled()) {
> >- if (__GFP_WAIT & mem_flags) {
> >+ if (__GFP_DIRECT_RECLAIM & mem_flags) {
> > printk(KERN_ERR "invalid context for function that migh"
> > "t sleep\n");
> > return -EINVAL;
>
> And here - no other flag manipulations and it would match the printk.

Fixed

> >--- a/fs/btrfs/extent_io.c
> >+++ b/fs/btrfs/extent_io.c
> >@@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> > if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
> > clear = 1;
> > again:
> >- if (!prealloc && (mask & __GFP_WAIT)) {
> >+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> > /*
> > * Don't care for allocation failure here because we might end
> > * up not needing the pre-allocated extent state at all, which
> >@@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> >
> > bits |= EXTENT_FIRST_DELALLOC;
> > again:
> >- if (!prealloc && (mask & __GFP_WAIT)) {
> >+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> > prealloc = alloc_extent_state(mask);
> > BUG_ON(!prealloc);
> > }
> >@@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> > btrfs_debug_check_extent_io_range(tree, start, end);
> >
> > again:
> >- if (!prealloc && (mask & __GFP_WAIT)) {
> >+ if (!prealloc && (mask & __GFP_DIRECT_RECLAIM)) {
> > /*
> > * Best effort, don't worry if extent state allocation fails
> > * here for the first iteration. We might have a cached state
> >@@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
> > u64 start = page_offset(page);
> > u64 end = start + PAGE_CACHE_SIZE - 1;
> >
> >- if ((mask & __GFP_WAIT) &&
> >+ if ((mask & __GFP_DIRECT_RECLAIM) &&
> > page->mapping->host->i_size > 16 * 1024 * 1024) {
> > u64 len;
> > while (start <= end) {
>
> Why not here as well.
>

Done

> >--- a/kernel/smp.c
> >+++ b/kernel/smp.c
> >@@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
> > cpumask_var_t cpus;
> > int cpu, ret;
> >
> >- might_sleep_if(gfp_flags & __GFP_WAIT);
> >+ might_sleep_if(gfp_flags & __GFP_DIRECT_RECLAIM);
> >
> > if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
> > preempt_disable();
> >diff --git a/lib/idr.c b/lib/idr.c
> >index 5335c43adf46..e5118fc82961 100644
> >--- a/lib/idr.c
> >+++ b/lib/idr.c
> >@@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
> > * allocation guarantee. Disallow usage from those contexts.
> > */
> > WARN_ON_ONCE(in_interrupt());
> >- might_sleep_if(gfp_mask & __GFP_WAIT);
> >+ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> > preempt_disable();
> >
> >@@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
> > struct idr_layer *pa[MAX_IDR_LEVEL + 1];
> > int id;
> >
> >- might_sleep_if(gfp_mask & __GFP_WAIT);
> >+ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> > /* sanity checks */
> > if (WARN_ON_ONCE(start < 0))
> >diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> >index f9ebe1c82060..cc5fdc3fb734 100644
> >--- a/lib/radix-tree.c
> >+++ b/lib/radix-tree.c
> >@@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
> > * preloading in the interrupt anyway as all the allocations have to
> > * be atomic. So just do normal allocation when in interrupt.
> > */
> >- if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> >+ if (!(gfp_mask & __GFP_DIRECT_RECLAIM) && !in_interrupt()) {
> > struct radix_tree_preload *rtp;
> >
> > /*
>
> These too?
>

Yep

> >diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> >index dac5bf59309d..2056d16807de 100644
> >--- a/mm/backing-dev.c
> >+++ b/mm/backing-dev.c
> >@@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
> > {
> > struct bdi_writeback *wb;
> >
> >- might_sleep_if(gfp & __GFP_WAIT);
> >+ might_sleep_if(gfp & __GFP_DIRECT_RECLAIM);
> >
> > if (!memcg_css->parent)
> > return &bdi->wb;
>
> ditto
>

Indeed.

> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2143,7 +2143,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> > return false;
> > if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
> > return false;
> >- if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> >+ if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>
> Should __GFP_ATOMIC really be here?
>

I felt it was safer because it felt in line with the intent of
alloc.ignore_gfp_wait.

> >diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> >index 197c3f59ecbf..c5fcdd6f85b7 100644
> >--- a/net/sctp/associola.c
> >+++ b/net/sctp/associola.c
> >@@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
> > /* Set an association id for a given association */
> > int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
> > {
> >- bool preload = !!(gfp & __GFP_WAIT);
> >+ bool preload = !!(gfp & __GFP_DIRECT_RECLAIM);
> > int ret;
> >
> > /* If the id is already assigned, keep it. */
>
> helper?
>

Yes. Thanks very much

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/