Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation failures

From: Dave Young
Date: Tue May 03 2011 - 22:32:10 EST

Next message: Robert P. J. Day: "[PATCH] RAW driver: Remove call to kobject_put()."
Previous message: Joe Perches: "[TRIVIAL PATCH 1/3] pcmcia: Make declaration and uses of struct pcmcia_device_id const"
In reply to: Dave Young: "Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation failures"
Next in thread: Wu Fengguang: "Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocationfailures"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, May 4, 2011 at 9:56 AM, Dave Young <hidave.darkstar@xxxxxxxxx> wrote:
> On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
>> Concurrent page allocations are suffering from high failure rates.
>>
>> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
>> the page allocation failures are
>>
>> nr_alloc_fail 733 Â Â Â # interleaved reads by 1 single task
>> nr_alloc_fail 11799 Â Â # concurrent reads by 1000 tasks
>>
>> The concurrent read test script is:
>>
>> Â Â Â Âfor i in `seq 1000`
>> Â Â Â Âdo
>> Â Â Â Â Â Â Â Âtruncate -s 1G /fs/sparse-$i
>> Â Â Â Â Â Â Â Âdd if=/fs/sparse-$i of=/dev/null &
>> Â Â Â Âdone
>>
>
> With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail

unset CONFIG_SCHED_AUTOGROUP and CONFIG_CGROUP_SCHED seems affects the
test results, now I see several nr_alloc_fail (dd is not finished
yet):

dave@darkstar-32:$ grep fail /proc/vmstat:
nr_alloc_fail 4
compact_pagemigrate_failed 0
compact_fail 3
htlb_buddy_alloc_fail 0
thp_collapse_alloc_fail 4

So the result is related to cpu scheduler.

>
>> In order for get_page_from_freelist() to get free page,
>>
>> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
>> Â Âcurrent SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
>> Â Âpossible low watermark state as well as fill the pcp with enough free
>> Â Âpages to overflow its high watermark.
>>
>> (2) the get_page_from_freelist() _after_ direct reclaim should use lower
>> Â Âwatermark than its normal invocations, so that it can reasonably
>> Â Â"reserve" some free pages for itself and prevent other concurrent
>> Â Âpage allocators stealing all its reclaimed pages.
>>
>> Some notes:
>>
>> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
>> Âreclaim allocation fails") has the same target, however is obviously
>> Âcostly and less effective. It seems more clean to just remove the
>> Âretry and drain code than to retain it.
>>
>> - it's a bit hacky to reclaim more than requested pages inside
>> Âdo_try_to_free_page(), and it won't help cgroup for now
>>
>> - it only aims to reduce failures when there are plenty of reclaimable
>> Âpages, so it stops the opportunistic reclaim when scanned 2 times pages
>>
>> Test results:
>>
>> - the failure rate is pretty sensible to the page reclaim size,
>> Âfrom 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>>
>> - the IPIs are reduced by over 100 times
>>
>> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
>> -------------------------------------------------------------------------------
>> nr_alloc_fail 10496
>> allocstall 1576602
>>
>> slabs_scanned 21632
>> kswapd_steal 4393382
>> kswapd_inodesteal 124
>> kswapd_low_wmark_hit_quickly 885
>> kswapd_high_wmark_hit_quickly 2321
>> kswapd_skip_congestion_wait 0
>> pageoutrun 29426
>>
>> CAL: Â Â 220449 Â Â 220246 Â Â 220372 Â Â 220558 Â Â 220251 Â Â 219740 Â Â 220043 Â Â 219968 Â Function call interrupts
>>
>> LOC: Â Â 536274 Â Â 532529 Â Â 531734 Â Â 536801 Â Â 536510 Â Â 533676 Â Â 534853 Â Â 532038 Â Local timer interrupts
>> RES: Â Â Â 3032 Â Â Â 2128 Â Â Â 1792 Â Â Â 1765 Â Â Â 2184 Â Â Â 1703 Â Â Â 1754 Â Â Â 1865 Â Rescheduling interrupts
>> TLB: Â Â Â Â189 Â Â Â Â 15 Â Â Â Â 13 Â Â Â Â 17 Â Â Â Â 64 Â Â Â Â294 Â Â Â Â 97 Â Â Â Â 63 Â TLB shootdowns
>
> Could you tell how to get above info?
>
>>
>> patched (WMARK_MIN)
>> -------------------
>> nr_alloc_fail 704
>> allocstall 105551
>>
>> slabs_scanned 33280
>> kswapd_steal 4525537
>> kswapd_inodesteal 187
>> kswapd_low_wmark_hit_quickly 4980
>> kswapd_high_wmark_hit_quickly 2573
>> kswapd_skip_congestion_wait 0
>> pageoutrun 35429
>>
>> CAL: Â Â Â Â 93 Â Â Â Â286 Â Â Â Â396 Â Â Â Â754 Â Â Â Â272 Â Â Â Â297 Â Â Â Â275 Â Â Â Â281 Â Function call interrupts
>>
>> LOC: Â Â 520550 Â Â 517751 Â Â 517043 Â Â 522016 Â Â 520302 Â Â 518479 Â Â 519329 Â Â 517179 Â Local timer interrupts
>> RES: Â Â Â 2131 Â Â Â 1371 Â Â Â 1376 Â Â Â 1269 Â Â Â 1390 Â Â Â 1181 Â Â Â 1409 Â Â Â 1280 Â Rescheduling interrupts
>> TLB: Â Â Â Â280 Â Â Â Â 26 Â Â Â Â 27 Â Â Â Â 30 Â Â Â Â 65 Â Â Â Â305 Â Â Â Â134 Â Â Â Â 75 Â TLB shootdowns
>>
>> patched (WMARK_HIGH)
>> --------------------
>> nr_alloc_fail 282
>> allocstall 53860
>>
>> slabs_scanned 23936
>> kswapd_steal 4561178
>> kswapd_inodesteal 0
>> kswapd_low_wmark_hit_quickly 2760
>> kswapd_high_wmark_hit_quickly 1748
>> kswapd_skip_congestion_wait 0
>> pageoutrun 32639
>>
>> CAL: Â Â Â Â 93 Â Â Â Â463 Â Â Â Â410 Â Â Â Â540 Â Â Â Â298 Â Â Â Â282 Â Â Â Â272 Â Â Â Â306 Â Function call interrupts
>>
>> LOC: Â Â 513956 Â Â 510749 Â Â 509890 Â Â 514897 Â Â 514300 Â Â 512392 Â Â 512825 Â Â 510574 Â Local timer interrupts
>> RES: Â Â Â 1174 Â Â Â 2081 Â Â Â 1411 Â Â Â 1320 Â Â Â 1742 Â Â Â 2683 Â Â Â 1380 Â Â Â 1230 Â Rescheduling interrupts
>> TLB: Â Â Â Â274 Â Â Â Â 21 Â Â Â Â 19 Â Â Â Â 22 Â Â Â Â 57 Â Â Â Â317 Â Â Â Â131 Â Â Â Â 61 Â TLB shootdowns
>>
>> this patch (WMARK_HIGH, limited scan)
>> -------------------------------------
>> nr_alloc_fail 276
>> allocstall 54034
>>
>> slabs_scanned 24320
>> kswapd_steal 4507482
>> kswapd_inodesteal 262
>> kswapd_low_wmark_hit_quickly 2638
>> kswapd_high_wmark_hit_quickly 1710
>> kswapd_skip_congestion_wait 0
>> pageoutrun 32182
>>
>> CAL: Â Â Â Â 69 Â Â Â Â443 Â Â Â Â421 Â Â Â Â567 Â Â Â Â273 Â Â Â Â279 Â Â Â Â269 Â Â Â Â334 Â Function call interrupts
>>
>> LOC: Â Â 514736 Â Â 511698 Â Â 510993 Â Â 514069 Â Â 514185 Â Â 512986 Â Â 513838 Â Â 511229 Â Local timer interrupts
>> RES: Â Â Â 2153 Â Â Â 1556 Â Â Â 1126 Â Â Â 1351 Â Â Â 3047 Â Â Â 1554 Â Â Â 1131 Â Â Â 1560 Â Rescheduling interrupts
>> TLB: Â Â Â Â209 Â Â Â Â 26 Â Â Â Â 20 Â Â Â Â 15 Â Â Â Â 71 Â Â Â Â315 Â Â Â Â117 Â Â Â Â 71 Â TLB shootdowns
>>
>> CC: Mel Gorman <mel@xxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
>> ---
>> Âmm/page_alloc.c | Â 17 +++--------------
>> Âmm/vmscan.c Â Â | Â Â6 ++++++
>> Â2 files changed, 9 insertions(+), 14 deletions(-)
>> --- linux-next.orig/mm/vmscan.c 2011-04-28 21:16:16.000000000 +0800
>> +++ linux-next/mm/vmscan.c Â Â Â2011-04-28 21:28:57.000000000 +0800
>> @@ -1978,6 +1978,8 @@ static void shrink_zones(int priority, s
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcontinue;
>> Â Â Â Â Â Â Â Â Â Â Â Âif (zone->all_unreclaimable && priority != DEF_PRIORITY)
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcontinue; Â Â Â /* Let kswapd poll it */
>> + Â Â Â Â Â Â Â Â Â Â Â sc->nr_to_reclaim = max(sc->nr_to_reclaim,
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â zone->watermark[WMARK_HIGH]);
>> Â Â Â Â Â Â Â Â}
>>
>> Â Â Â Â Â Â Â Âshrink_zone(priority, zone, sc);
>> @@ -2034,6 +2036,7 @@ static unsigned long do_try_to_free_page
>> Â Â Â Âstruct zoneref *z;
>> Â Â Â Âstruct zone *zone;
>> Â Â Â Âunsigned long writeback_threshold;
>> + Â Â Â unsigned long min_reclaim = sc->nr_to_reclaim;
>>
>> Â Â Â Âget_mems_allowed();
>> Â Â Â Âdelayacct_freepages_start();
>> @@ -2067,6 +2070,9 @@ static unsigned long do_try_to_free_page
>> Â Â Â Â Â Â Â Â Â Â Â Â}
>> Â Â Â Â Â Â Â Â}
>> Â Â Â Â Â Â Â Âtotal_scanned += sc->nr_scanned;
>> + Â Â Â Â Â Â Â if (sc->nr_reclaimed >= min_reclaim &&
>> + Â Â Â Â Â Â Â Â Â total_scanned > 2 * sc->nr_to_reclaim)
>> + Â Â Â Â Â Â Â Â Â Â Â goto out;
>> Â Â Â Â Â Â Â Âif (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> Â Â Â Â Â Â Â Â Â Â Â Âgoto out;
>>
>> --- linux-next.orig/mm/page_alloc.c Â Â 2011-04-28 21:16:16.000000000 +0800
>> +++ linux-next/mm/page_alloc.c Â2011-04-28 21:16:18.000000000 +0800
>> @@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>> Â Â Â Ânodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
>> Â Â Â Âint migratetype, unsigned long *did_some_progress)
>> Â{
>> - Â Â Â struct page *page = NULL;
>> + Â Â Â struct page *page;
>> Â Â Â Âstruct reclaim_state reclaim_state;
>> - Â Â Â bool drained = false;
>>
>> Â Â Â Âcond_resched();
>>
>> @@ -1912,22 +1911,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>> Â Â Â Âif (unlikely(!(*did_some_progress)))
>> Â Â Â Â Â Â Â Âreturn NULL;
>>
>> -retry:
>> + Â Â Â alloc_flags |= ALLOC_HARDER;
>> +
>> Â Â Â Âpage = get_page_from_freelist(gfp_mask, nodemask, order,
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âzonelist, high_zoneidx,
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âalloc_flags, preferred_zone,
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âmigratetype);
>> -
>> - Â Â Â /*
>> - Â Â Â Â* If an allocation failed after direct reclaim, it could be because
>> - Â Â Â Â* pages are pinned on the per-cpu lists. Drain them and try again
>> - Â Â Â Â*/
>> - Â Â Â if (!page && !drained) {
>> - Â Â Â Â Â Â Â drain_all_pages();
>> - Â Â Â Â Â Â Â drained = true;
>> - Â Â Â Â Â Â Â goto retry;
>> - Â Â Â }
>> -
>> Â Â Â Âreturn page;
>> Â}
>>
>>
>
>
>
> --
> Regards
> dave
>

--
Regards
dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Robert P. J. Day: "[PATCH] RAW driver: Remove call to kobject_put()."
Previous message: Joe Perches: "[TRIVIAL PATCH 1/3] pcmcia: Make declaration and uses of struct pcmcia_device_id const"
In reply to: Dave Young: "Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation failures"
Next in thread: Wu Fengguang: "Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocationfailures"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]