Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

From: Mel Gorman
Date: Wed Nov 13 2019 - 06:20:51 EST

Next message: Taniya Das: "Re: [PATCH v9 1/4] dt-bindings: clock: Document external clocks for MSM8998 gcc"
Previous message: Ilya Dryomov: "Re: 'current_state' is uninitialized in rbd_object_map_update_finish()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Nov 06, 2019 at 01:32:37PM -0800, David Rientjes wrote:
> On Wed, 6 Nov 2019, Michal Hocko wrote:
>
> > > I don't see any
> > > indication that this allocation would behave any different than the code
> > > that Andrea experienced swap storms with, but now worse if remote memory
> > > is in the same state local memory is when he's using __GFP_THISNODE.
> >
> > The primary reason for the extensive swapping was exactly the __GFP_THISNODE
> > in conjunction with an unbounded direct reclaim AFAIR.
> >
> > The whole point of the Vlastimil's patch is to have an optimistic local
> > node allocation first and the full gfp context one in the fallback path.
> > If our full gfp context doesn't really work well then we can revisit
> > that of course but that should happen at alloc_hugepage_direct_gfpmask
> > level.
>
> Since the patch reverts the precaution put into the page allocator to not
> attempt reclaim if the allocation order is significantly large and the
> return value from compaction specifies it is unlikely to succed on its
> own, I believe Vlastimil's patch will cause the same regression that
> Andrea saw is the whole host is low on memory and/or significantly
> fragmented. So the suggestion was that he test this change to make sure
> we aren't introducing a regression for his workload.

TLDR: I do not have evidence that Vlastimil's patch causes more swapping
but more information is needed from Andrea on exactly how he's
testing this. It's not clear to me what was originally tested
and whether memory just had to be full or whether it had to be
fragmented. If fragmented, then we have to agree on what an
appropriate mechanism is for fragmenting memory. Hypothetical
kernel modules that don't exist do not count.

I put together a testcase whereby a virtual machine is deployed, started
and then time how long it takes to run memhog on 80% of the guests
physical memory. I varied how large the virtual machine is and ran it on
a 2-socket machine so that the smaller tests would be single node and
the larger tests would span both nodes. Before each startup, a large
file is read to fill the memory with pagecache.

kvmstart
5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6
vanilla vanilla vanilla thptweak-v1r1
Amean 5 3.43 ( 0.00%) 3.44 ( -0.29%) 3.44 ( -0.19%) 3.39 ( 1.07%)
Amean 14 11.18 ( 0.00%) 14.59 ( -30.53%) 14.85 ( -32.80%) 10.30 ( 7.87%)
Amean 23 20.89 ( 0.00%) 18.45 ( 11.70%) 24.92 ( -19.29%) 24.88 ( -19.12%)
Amean 32 51.69 ( 0.00%) 30.07 ( 41.82%) 30.93 ( 40.16%) 47.97 ( 7.20%)
Amean 41 51.44 ( 0.00%) 29.99 * 41.71%* 60.75 ( -18.08%) 77.44 * -50.54%*
Amean 50 81.85 ( 0.00%) 60.37 ( 26.25%) 98.09 ( -19.84%) 125.59 * -53.43%*

Units are in seconds to run memhog. "Amean 5" is for a 5G virtual
machine. Target machine is 2-socket with 64G of RAM so at 32 you'd expect
that that the machine is larger than a NUMA node. It's set to cutoff at
a virtual machine 80% the size of physical memory.

I used 5.2 as a baseline as 5.3 is where THP allocations changed behaviour
which was reverted later and then tweaked.

The results show that 5.4.0-rc6 in general is slower to fault all the
memory of the virtual machine even before you'd expect the machine to be
larger than a NUMA node. The patch works better for smaller machines and
worse for larger machines

5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6
vanilla vanilla vanilla thptweak-v1r1
Ops Swap Ins 1442665.00 1400209.00 1289549.00 1762479.00
Ops Swap Outs 2291255.00 1689554.00 2222256.00 2885280.00
Ops Kswapd efficiency % 98.74 91.06 75.73 49.75
Ops Kswapd velocity 7875.25 7973.52 8035.11 9129.89
Ops Direct efficiency % 99.54 98.22 94.47 87.80
Ops Direct velocity 5042.27 4982.09 7238.28 9251.50
Ops Percentage direct scans 39.03 38.46 47.39 50.33
Ops Page writes by reclaim 2291779.00 1689555.00 2222771.00 2885334.00
Ops Page writes file 524.00 1.00 515.00 54.00
Ops Page writes anon 2291255.00 1689554.00 2222256.00 2885280.00
Ops Page reclaim immediate 2548.00 76.00 367.00 710.00
Ops Sector Reads 320255172.00 309217632.00 264732588.00 269864268.00
Ops Sector Writes 63264744.00 60776604.00 62860244.00 65572608.00
Ops Page rescued immediate 0.00 0.00 0.00 0.00
Ops Slabs scanned 595876.00 246334.00 1018425.00 1390506.00
Ops Direct inode steals 49.00 5.00 0.00 8.00
Ops Kswapd inode steals 24118244.00 28126116.00 12866766.00 12943742.00
Ops Kswapd skipped wait 0.00 0.00 0.00 0.00
Ops THP fault alloc 164266.00 204790.00 188055.00 190899.00
Ops THP fault fallback 49345.00 8614.00 25454.00 22650.00
Ops THP collapse alloc 132.00 139.00 116.00 121.00
Ops THP collapse fail 4.00 0.00 1.00 2.00
Ops THP split 15789.00 5642.00 18119.00 49169.00
Ops THP split failed 0.00 0.00 0.00 0.00
Ops Compaction stalls 139794.00 52054.00 214361.00 226514.00
Ops Compaction success 19004.00 17786.00 39430.00 35922.00
Ops Compaction failures 120790.00 34268.00 174931.00 190592.00
Ops Compaction efficiency 13.59 34.17 18.39 15.86
Ops Page migrate success 11427696.00 12758554.00 22010443.00 21668849.00
Ops Page migrate failure 11559690.00 10403623.00 13514889.00 13212313.00
Ops Compaction pages isolated 23217363.00 20560760.00 46945743.00 47574686.00
Ops Compaction migrate scanned 19925995.00 16224351.00 49565044.00 58595534.00
Ops Compaction free scanned 68708150.00 47235800.00 134685089.00 153518780.00
Ops Compact scan efficiency 29.00 34.35 36.80 38.17
Ops Compaction cost 12604.40 13893.25 24274.97 23991.33
Ops Kcompactd wake 100.00 172.00 100.00 258.00
Ops Kcompactd migrate scanned 33135.00 55797.00 113344.00 948026.00
Ops Kcompactd free scanned 335005.00 310265.00 174353.00 686434.00
Ops NUMA alloc hit 98173280.00 77063265.00 82393545.00 70699591.00
Ops NUMA alloc miss 22834593.00 20306614.00 12296731.00 24085033.00
Ops NUMA interleave hit 0.00 0.00 0.00 0.00
Ops NUMA alloc local 98163641.00 77059289.00 82384101.00 70697814.00
Ops NUMA base-page range updates 53170070.00 41093095.00 62202747.00 71769257.00
Ops NUMA PTE updates 15041942.00 2726887.00 12972411.00 21418665.00
Ops NUMA PMD updates 74469.00 74934.00 96153.00 98341.00
Ops NUMA hint faults 11492208.00 2337770.00 9146157.00 16587622.00
Ops NUMA hint local faults % 6465386.00 1080902.00 7087393.00 12336942.00
Ops NUMA hint local percent 56.26 46.24 77.49 74.37
Ops NUMA pages migrated 560888.00 3004903.00 336032.00 195839.00
Ops AutoNUMA cost 57843.89 12033.59 46172.59 83444.22

There is a lot here. Both 5.4.0-rc6 and the tweak had more memory placed
locally (NUMA hint local percent). *All* kernels swapped pages in an out
with 5.4.0-rc6 being as bad as 5.2.0-vanilla and the patch making this
slightly but not considerably worse.

5.3.0 was generally more successful at allocating huge pages (THP fault
fallback) with 5.4.0-rc6 being much worse at allocating huge pages and
the tweak not making much of a difference.

So the patch is a mix of good and bad in this case. However, the test
case has notable limitations and more information would be needed from
Andrea on exactly how he was evaluating KVM start times.

1. memhog is single threaded which means that one node is likely to be
filled first, spillover while the first node reclaims. This means that
swapping is somewhat inevitable on NUMA machines with this load. We
could run multiple instances but that would have very variable results.

2. Reading a single large file forces reclaim but it definitely is not
fragmenting memory. However, we never all agreed on how a machine
should be fragmented to be useful for this sort of test. Mentioning
hypothetical kernel modules that don't exist is not good enough.
Creating massive amounts of small files with fragment memory to some
extent but not aggressively. Doing that while memhog is running
would help but the results would be very variable.

This particular test is hard to reproduce even though it's in mmtests as
configs/config-workload-kvmstart-memhog-frag-singlefile because the test
relies on KVM helper scripts to deploy, start and stop the virtual
machine. These scripts exist outside of mmtests and belong to a set of
tools I use to schedule, execute and report on tests across a range of
machines.

I also ran a THP faulting benchmark with fio running in the background
in a configuration that tends to fragment memory (mmtests config
workload-thpfioscale-madvhugepage) with ext4 as a backing filesystem for
fio.

5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6
vanilla vanilla vanilla thptweak-v1r1
Min fault-base-5 958.00 ( 0.00%) 1980.00 (-106.68%) 1083.00 ( -13.05%) 1406.00 ( -46.76%)
Min fault-huge-5 297.00 ( 0.00%) 309.00 ( -4.04%) 431.00 ( -45.12%) 324.00 ( -9.09%)
Min fault-both-5 297.00 ( 0.00%) 309.00 ( -4.04%) 431.00 ( -45.12%) 324.00 ( -9.09%)
Amean fault-base-5 2517.81 ( 0.00%) 8886.23 *-252.94%* 4851.98 * -92.71%* 8223.64 *-226.62%*
Amean fault-huge-5 2781.67 ( 0.00%) 10397.46 *-273.78%* 3596.91 * -29.31%* 7139.44 *-156.66%*
Amean fault-both-5 2662.24 ( 0.00%) 9916.03 *-272.47%* 3990.46 * -49.89%* 7367.48 *-176.74%*
Stddev fault-base-5 2713.85 ( 0.00%) 24331.73 (-796.58%) 5530.37 (-103.78%) 27048.99 (-896.70%)
Stddev fault-huge-5 3740.46 ( 0.00%) 39529.80 (-956.82%) 5428.68 ( -45.13%) 23418.76 (-526.09%)
Stddev fault-both-5 3317.75 ( 0.00%) 35408.44 (-967.24%) 5491.27 ( -65.51%) 24229.15 (-630.29%)
CoeffVar fault-base-5 107.79 ( 0.00%) 273.81 (-154.03%) 113.98 ( -5.75%) 328.92 (-205.16%)
CoeffVar fault-huge-5 134.47 ( 0.00%) 380.19 (-182.73%) 150.93 ( -12.24%) 328.02 (-143.94%)
CoeffVar fault-both-5 124.62 ( 0.00%) 357.08 (-186.53%) 137.61 ( -10.42%) 328.87 (-163.89%)
Max fault-base-5 88873.00 ( 0.00%) 386539.00 (-334.93%) 115638.00 ( -30.12%) 486930.00 (-447.89%)
Max fault-huge-5 63735.00 ( 0.00%) 602544.00 (-845.39%) 139082.00 (-118.22%) 426777.00 (-569.61%)
Max fault-both-5 88873.00 ( 0.00%) 602544.00 (-577.98%) 139082.00 ( -56.50%) 486930.00 (-447.89%)
BAmean-50 fault-base-5 1192.71 ( 0.00%) 3101.71 (-160.06%) 2170.91 ( -82.02%) 2692.97 (-125.79%)
BAmean-50 fault-huge-5 756.99 ( 0.00%) 972.96 ( -28.53%) 1112.99 ( -47.03%) 1168.70 ( -54.39%)
BAmean-50 fault-both-5 953.65 ( 0.00%) 1455.39 ( -52.61%) 1319.56 ( -38.37%) 1375.04 ( -44.19%)
BAmean-95 fault-base-5 2109.87 ( 0.00%) 4941.87 (-134.23%) 4056.10 ( -92.24%) 4672.90 (-121.48%)
BAmean-95 fault-huge-5 2158.64 ( 0.00%) 3867.03 ( -79.14%) 2766.41 ( -28.16%) 3395.88 ( -57.32%)
BAmean-95 fault-both-5 2127.00 ( 0.00%) 4183.64 ( -96.69%) 3169.07 ( -48.99%) 3666.57 ( -72.38%)
BAmean-99 fault-base-5 2349.85 ( 0.00%) 6811.21 (-189.86%) 4512.17 ( -92.02%) 5738.18 (-144.19%)
BAmean-99 fault-huge-5 2538.90 ( 0.00%) 7109.96 (-180.04%) 3225.96 ( -27.06%) 5194.97 (-104.61%)
BAmean-99 fault-both-5 2443.68 ( 0.00%) 6952.77 (-184.52%) 3626.66 ( -48.41%) 5300.01 (-116.89%)

thpfioscale Percentage Faults Huge
5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6
vanilla vanilla vanilla thptweak-v1r1
Percentage huge-5 54.74 ( 0.00%) 68.14 ( 24.49%) 68.64 ( 25.40%) 78.97 ( 44.27%)

In this case, 5.4.0-rc6 has lower latency when allocating pages whether
THP is used or a fallback to base pages. The patch has latency somewhere
between 5.2 and 5.4-rc6 indicating that more effort is being made.
However, as expected the allocation success rate of the patch is higher
than all the others. This indicates the higher latency is due to the
additional work done to allocate the page.

Finally I ran a benchmark that faults memory in such a way as to expect
high compaction activity

usemem
5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6
vanilla vanilla vanilla thptweak-v1r1
Amean elsp-1 42.09 ( 0.00%) 26.59 * 36.84%* 36.75 ( 12.70%) 40.23 ( 4.42%)
Amean elsp-3 32.96 ( 0.00%) 7.48 * 77.29%* 8.73 * 73.50%* 11.49 * 65.15%*
Amean elsp-4 5.69 ( 0.00%) 5.85 * -2.81%* 5.68 ( 0.18%) 5.68 ( 0.18%)

5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6
vanilla vanilla vanilla thptweak-v1r1
Ops Swap Ins 2937652.00 0.00 695423.00 224792.00
Ops Swap Outs 3647757.00 0.00 797035.00 262654.00

This is showing that 5.3.0 does not swap at all and is the fastest,
5.4.0-rc6 introduced swapping and the patch reduced it somewhat.

So, the patch behaves more or less as expected and I'm not seeing
clear evidence that the patch makes swapping more likely as feared by
David. However, the kvm testcase is very limited and needs more information
on exactly how Andrea was testing it to induce swap storms. I can easily
make guesses but chances are that I'll guess wrong.

--
Mel Gorman
SUSE Labs

Next message: Taniya Das: "Re: [PATCH v9 1/4] dt-bindings: clock: Document external clocks for MSM8998 gcc"
Previous message: Ilya Dryomov: "Re: 'current_state' is uninitialized in rbd_object_map_update_finish()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]