Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocationfailures

From: Wu Fengguang
Date: Thu Apr 28 2011 - 22:29:13 EST


> Test results:
>
> - the failure rate is pretty sensible to the page reclaim size,
> from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>
> - the IPIs are reduced by over 100 times

It's reduced by 500 times indeed.

CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts
CAL: 93 463 410 540 298 282 272 306 Function call interrupts

> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
> nr_alloc_fail 10496
> allocstall 1576602

> patched (WMARK_MIN)
> -------------------
> nr_alloc_fail 704
> allocstall 105551

> patched (WMARK_HIGH)
> --------------------
> nr_alloc_fail 282
> allocstall 53860

> this patch (WMARK_HIGH, limited scan)
> -------------------------------------
> nr_alloc_fail 276
> allocstall 54034

There is a bad side effect though: the much reduced "allocstall" means
each direct reclaim will take much more time to complete. A simple solution
is to terminate direct reclaim after 10ms. I noticed that an 100ms
time threshold can reduce the reclaim latency from 621ms to 358ms.
Further lowering the time threshold to 20ms does not help reducing the
real latencies though.

However the very subjective perception is, in such heavy 1000-dd
workload, the reduced reclaim latency hardly improves the overall
responsiveness.

base kernel
-----------

start time: 243
total time: 529

wfg@fat ~% getdelays -dip 3971
print delayacct stats ON
printing IO accounting
PID 3971


CPU count real total virtual total delay total
961 3176517096 3158468847 313952766099
IO count delay total delay average
2 181251847 60ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1205 38120615476 31ms
dd: read=16384, write=0, cancelled_write=0
wfg@fat ~% getdelays -dip 3383
print delayacct stats ON
printing IO accounting
PID 3383


CPU count real total virtual total delay total
1270 4206360536 4181445838 358641985177
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1606 39897314399 24ms
dd: read=0, write=0, cancelled_write=0

no time limit
-------------
wfg@fat ~% getdelays -dip `pidof dd`
print delayacct stats ON
printing IO accounting
PID 9609


CPU count real total virtual total delay total
865 2792575464 2779071029 235345541230
IO count delay total delay average
4 300247552 60ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
32 20504634169 621ms
dd: read=106496, write=0, cancelled_write=0

100ms limit
-----------

start time: 288
total time: 514
nr_alloc_fail 1269
allocstall 128915

wfg@fat ~% getdelays -dip `pidof dd`
print delayacct stats ON
printing IO accounting
PID 5077


CPU count real total virtual total delay total
937 2949551600 2935087806 207877301298
IO count delay total delay average
1 151891691 151ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
71 25475514278 358ms
dd: read=507904, write=0, cancelled_write=0

PID 5101


CPU count real total virtual total delay total
1201 3827418144 3805399187 221075772599
IO count delay total delay average
4 300331997 60ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
94 31996779648 336ms
dd: read=618496, write=0, cancelled_write=0

nr_alloc_fail 937
allocstall 128684

slabs_scanned 63616
kswapd_steal 4616011
kswapd_inodesteal 5
kswapd_low_wmark_hit_quickly 5394
kswapd_high_wmark_hit_quickly 2826
kswapd_skip_congestion_wait 0
pageoutrun 36679

20ms limit
----------

start time: 294
total time: 516
nr_alloc_fail 1662
allocstall 132101

CPU count real total virtual total delay total
839 2750581848 2734464704 198489159459
IO count delay total delay average
1 43566814 43ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
95 35234061367 370ms
dd: read=20480, write=0, cancelled_write=0

test script
-----------
tic=$(date +'%s')

for i in `seq 1000`
do
truncate -s 1G /fs/sparse-$i
dd if=/fs/sparse-$i of=/dev/null &>/dev/null &
done

tac=$(date +'%s')
echo start time: $((tac-tic))

wait

tac=$(date +'%s')
echo total time: $((tac-tic))

egrep '(nr_alloc_fail|allocstall)' /proc/vmstat

Thanks,
Fengguang
---
Subject: mm: limit direct reclaim delays
Date: Fri Apr 29 09:04:11 CST 2011

Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
---
mm/vmscan.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/vmscan.c 2011-04-29 09:02:42.000000000 +0800
+++ linux-next/mm/vmscan.c 2011-04-29 09:04:10.000000000 +0800
@@ -2037,6 +2037,7 @@ static unsigned long do_try_to_free_page
struct zone *zone;
unsigned long writeback_threshold;
unsigned long min_reclaim = sc->nr_to_reclaim;
+ unsigned long start_time = jiffies;

get_mems_allowed();
delayacct_freepages_start();
@@ -2070,11 +2071,14 @@ static unsigned long do_try_to_free_page
}
}
total_scanned += sc->nr_scanned;
- if (sc->nr_reclaimed >= min_reclaim &&
- total_scanned > 2 * sc->nr_to_reclaim)
- goto out;
- if (sc->nr_reclaimed >= sc->nr_to_reclaim)
- goto out;
+ if (sc->nr_reclaimed >= min_reclaim) {
+ if (sc->nr_reclaimed >= sc->nr_to_reclaim)
+ goto out;
+ if (total_scanned > 2 * sc->nr_to_reclaim)
+ goto out;
+ if (jiffies - start_time > HZ / 100)
+ goto out;
+ }

/*
* Try to write back as many pages as we just scanned. This
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/