Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

From: Chris Mason
Date: Wed Sep 16 2015 - 23:49:47 EST


On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> [cc Tejun]
>
> On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > > Dave's original. I should have done the blktrace runs for 60 seconds
> > > > > instead of 30, I suspect that would even out the average sizes between
> > > > > the three patches.
> > > >
> > > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > > right?
> > >
> > > Looks like it works well to me.
> >
> > Graph looks good, though I'll confirm it on my test rig once I get
> > out from under the pile of email and other stuff that is queued up
> > after being away for a week...
>
> I ran some tests in the background while reading other email.....
>
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly. We need to revert the plugging changes until the
> underlying writeback performance regressions are sorted out.
>
> In more detail, these tests were run on my usual 16p/16GB RAM
> performance test VM with storage set up as described here:
>
> https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b
>
> The test:
>
> $ ~/tests/fsmark-10-4-test-xfs.sh
> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=0
> data = bsize=4096 blocks=134217727500, imaxpct=1
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> log =internal log bsize=4096 blocks=131072, version=2
> = sectsz=512 sunit=1 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7
> # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> # Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> # App overhead is time in microseconds spent in the test not doing file writing related system calls.
>
> FSUse% Count Size Files/sec App Overhead
> 0 80000 4096 106938.0 543310
> 0 160000 4096 102922.7 476362
> 0 240000 4096 107182.9 538206
> 0 320000 4096 107871.7 619821
> 0 400000 4096 99255.6 622021
> 0 480000 4096 103217.8 609943
> 0 560000 4096 96544.2 640988
> 0 640000 4096 100347.3 676237
> 0 720000 4096 87534.8 483495
> 0 800000 4096 72577.5 2556920
> 0 880000 4096 97569.0 646996
>
> <RAM fills here, sustained performance is now dependent on writeback>

I think too many variables have changed here.

My numbers:

FSUse% Count Size Files/sec App Overhead
0 160000 4096 356407.1 1458461
0 320000 4096 368755.1 1030047
0 480000 4096 358736.8 992123
0 640000 4096 361912.5 1009566
0 800000 4096 342851.4 1004152
0 960000 4096 358357.2 996014
0 1120000 4096 338025.8 1004412
0 1280000 4096 354440.3 997380
0 1440000 4096 335225.9 1000222
0 1600000 4096 278786.1 1164962
0 1760000 4096 268161.4 1205255
0 1920000 4096 259158.0 1298054
0 2080000 4096 276939.1 1219411
0 2240000 4096 252385.1 1245496
0 2400000 4096 280674.1 1189161
0 2560000 4096 290155.4 1141941
0 2720000 4096 280842.2 1179964
0 2880000 4096 272446.4 1155527
0 3040000 4096 268827.4 1235095
0 3200000 4096 251767.1 1250006
0 3360000 4096 248339.8 1235471
0 3520000 4096 267129.9 1200834
0 3680000 4096 257320.7 1244854
0 3840000 4096 233540.8 1267764
0 4000000 4096 269237.0 1216324
0 4160000 4096 249787.6 1291767
0 4320000 4096 256185.7 1253776
0 4480000 4096 257849.7 1212953
0 4640000 4096 253933.9 1181216
0 4800000 4096 263567.2 1233937
0 4960000 4096 255666.4 1231802
0 5120000 4096 257083.2 1282893
0 5280000 4096 254285.0 1229031
0 5440000 4096 265561.6 1219472
0 5600000 4096 266374.1 1229886
0 5760000 4096 241003.7 1257064
0 5920000 4096 245047.4 1298330
0 6080000 4096 254771.7 1257241
0 6240000 4096 254355.2 1261006
0 6400000 4096 254800.4 1201074
0 6560000 4096 262794.5 1234816
0 6720000 4096 248103.0 1287921
0 6880000 4096 231397.3 1291224
0 7040000 4096 227898.0 1285359
0 7200000 4096 227279.6 1296340
0 7360000 4096 232561.5 1748248
0 7520000 4096 231055.3 1169373
0 7680000 4096 245738.5 1121856
0 7840000 4096 234961.7 1147035
0 8000000 4096 243973.0 1152202
0 8160000 4096 246292.6 1169527
0 8320000 4096 249433.2 1197921
0 8480000 4096 222576.0 1253650
0 8640000 4096 239407.5 1263257
0 8800000 4096 246037.1 1218109
0 8960000 4096 242306.5 1293567
0 9120000 4096 238525.9 3745133
0 9280000 4096 269869.5 1159541
0 9440000 4096 266447.1 4794719
0 9600000 4096 265748.9 1161584
0 9760000 4096 269067.8 1149918
0 9920000 4096 248896.2 1164112
0 10080000 4096 261342.9 1174536
0 10240000 4096 254778.3 1225425
0 10400000 4096 257702.2 1211634
0 10560000 4096 233972.5 1203665
0 10720000 4096 232647.1 1197486
0 10880000 4096 242320.6 1203984

I can push the dirty threshold lower to try and make sure we end up in
the hard dirty limits but none of this is going to be related to the
plugging patch. I do see lower numbers if I let the test run even
longer, but there are a lot of things in the way that can slow it down
as the filesystem gets that big.

I'll try again with lower ratios.

[ ... ]

> The baseline of no plugging is a full 3 minutes faster than the
> plugging behaviour of Linus' patch. The IO behaviour demonstrates
> that, sustaining between 25-30,000 IOPS and throughput of
> 130-150MB/s. Hence, while Linus' patch does change the IO patterns,
> it does not result in a performance improvement like the original
> plugging patch did.
>

How consistent is this across runs?

> So I went back and had a look at my original patch, which I've been
> using locally for a couple of years and was similar to the original
> commit. It has this description from when I last updated the perf
> numbers from testing done on 3.17:
>
> | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
> | metadata CRCs enabled.
> |
> | Test:
> |
> | $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d
> | /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d
> | /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d
> | /mnt/scratch/6 -d /mnt/scratch/7
> |
> | Result:
> | wall sys create rate Physical write IO
> | time CPU (avg files/s) IOPS Bandwidth
> | ----- ----- ------------- ------ ---------
> | unpatched 5m54s 15m32s 32,500+/-2200 28,000 150MB/s
> | patched 3m19s 13m28s 52,900+/-1800 1,500 280MB/s
> | improvement -43.8% -13.3% +62.7% -94.6% +86.6%
>
> IOWs, what we are seeing here is that the baseline writeback
> performance has regressed quite significantly since I took these
> numbers back on 3.17. I'm running on exactly the same test setup;
> the only difference is the kernel and so the current kernel baseline
> is ~20% slower than the baseline numbers I have in my patch.

All of this in a VM, I'd much rather see this reproduced on bare metal.
I've had really consistent results with VMs in the past, but there is a
huge amount of code between 3.17 and now.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/