Re: [RFC PATCH 0/5] Improve hugepage allocation success ratesunder load V3

From: Jim Schutt
Date: Thu Aug 09 2012 - 14:16:39 EST


On 08/09/2012 07:49 AM, Mel Gorman wrote:
Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order> 0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case. There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order> 0 compaction start off where it left].

On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing. Here's what vmstat had to say during that period:

----------

2012-08-09 11:58:04.107-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
20 14 0 235884 576 38916072 0 0 12 17047 171 133 3 8 85 4 0
18 17 0 220272 576 38955912 0 0 86 2131838 200142 162956 12 38 31 19 0
17 9 0 244284 576 38955328 0 0 19 2179562 213775 167901 13 43 26 18 0
27 15 0 223036 576 38952640 0 0 24 2202816 217996 158390 14 47 25 15 0
17 16 0 233124 576 38959908 0 0 5 2268815 224647 165728 14 50 21 15 0
16 13 0 225840 576 38995740 0 0 52 2253829 216797 160551 14 47 23 16 0
22 13 0 260584 576 38982908 0 0 92 2196737 211694 140924 14 53 19 15 0
16 10 0 235784 576 38917128 0 0 22 2157466 210022 137630 14 54 19 14 0
12 13 0 214300 576 38923848 0 0 31 2187735 213862 142711 14 52 20 14 0
25 12 0 219528 576 38919540 0 0 11 2066523 205256 142080 13 49 23 15 0
26 14 0 229460 576 38913704 0 0 49 2108654 200692 135447 13 51 21 15 0
11 11 0 220376 576 38862456 0 0 45 2136419 207493 146813 13 49 22 16 0
36 12 0 229860 576 38869784 0 0 7 2163463 212223 151812 14 47 25 14 0
16 13 0 238356 576 38891496 0 0 67 2251650 221728 154429 14 52 20 14 0
65 15 0 211536 576 38922108 0 0 59 2237925 224237 156587 14 53 19 14 0
24 13 0 585024 576 38634024 0 0 37 2240929 229040 148192 15 61 14 10 0

2012-08-09 11:59:04.714-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
43 8 0 794392 576 38382316 0 0 11 20491 576 420 3 10 82 4 0
127 6 0 579328 576 38422156 0 0 21 2006775 205582 119660 12 70 11 7 0
44 5 0 492860 576 38512360 0 0 46 1536525 173377 85320 10 78 7 4 0
218 9 0 585668 576 38271320 0 0 39 1257266 152869 64023 8 83 7 3 0
101 6 0 600168 576 38128104 0 0 10 1438705 160769 68374 9 84 5 3 0
62 5 0 597004 576 38098972 0 0 93 1376841 154012 63912 8 82 7 4 0
61 11 0 850396 576 37808772 0 0 46 1186816 145731 70453 7 78 9 6 0
124 7 0 437388 576 38126320 0 0 15 1208434 149736 57142 7 86 4 3 0
204 11 0 1105816 576 37309532 0 0 20 1327833 145979 52718 7 87 4 2 0
29 8 0 751020 576 37360332 0 0 8 1405474 169916 61982 9 85 4 2 0
38 7 0 626448 576 37333244 0 0 14 1328415 174665 74214 8 84 5 3 0
23 5 0 650040 576 37134280 0 0 28 1351209 179220 71631 8 85 5 2 0
40 10 0 610988 576 37054292 0 0 104 1272527 167530 73527 7 85 5 3 0
79 22 0 2076836 576 35487340 0 0 750 1249934 175420 70124 7 88 3 2 0
58 6 0 431068 576 36934140 0 0 1000 1366234 169675 72524 8 84 5 3 0
134 9 0 574692 576 36784980 0 0 1049 1305543 152507 62639 8 84 4 4 0

2012-08-09 12:00:09.137-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0
104 14 0 3140508 576 33522616 0 0 299 1414709 160879 51422 9 89 1 1 0
100 11 0 1323036 576 35337740 0 0 429 1637733 175817 94471 9 73 10 8 0
91 11 0 673320 576 35918084 0 0 562 1477100 157069 67951 8 83 5 4 0
35 15 0 3486592 576 32983244 0 0 384 1574186 189023 82135 9 81 5 5 0
51 16 0 1428108 576 34962112 0 0 394 1573231 160575 76632 9 76 9 7 0
55 6 0 719548 576 35621284 0 0 425 1483962 160335 79991 8 74 10 7 0
96 7 0 1226852 576 35062608 0 0 803 1531041 164923 70820 9 78 7 6 0
97 8 0 862500 576 35332496 0 0 536 1177949 155969 80769 7 74 13 7 0
23 5 0 6096372 576 30115776 0 0 367 919949 124993 81755 6 62 24 8 0
13 5 0 7427860 576 28368292 0 0 399 915331 153895 102186 6 53 32 9 0

----------

And here's a perf report, captured/displayed with
perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5
sometime during that period just after 12:00:09, when
the run queueu was > 100.

----------

Processed 0 events and LOST 1175296!

Check IO/CPU overload!

# Events: 208K cycles
#
# Overhead

Symbol
# ........ .....................................................................................................................................................................................
.................................................................................................................................................................................................
............................................................................................................
#
34.63% [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--87.39%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --12.61%-- memcpy
--2.70%-- [...]

14.31% [k] _raw_spin_lock_irq
|
|--98.08%-- isolate_migratepages_range
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--83.93%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --16.07%-- memcpy
--1.92%-- [...]

5.48% [k] isolate_freepages_block
|
|--99.96%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--86.01%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --13.99%-- memcpy
--0.04%-- [...]

5.34% [.] ceph_crc32c_le
|
|--99.95%-- 0xb8057558d0065990
--0.05%-- [...]

----------

If I understand what this is telling me, skb_copy_datagram_iovec
is responsible for triggering the calls to isolate_freepages_block,
isolate_migratepages_range, and isolate_freepages?

FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames
and the Linux TCP stack (i.e., no stateful TCP offload).

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/