Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

From: Dave Chinner
Date: Mon Apr 28 2014 - 19:48:09 EST


ping?

On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> > Hi,
> >
> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > top#1 concern is frequent slow read()s that happen while another process
> > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > ages.
> > The "conclusion" of that part was that it'd be very useful to have a
> > demonstration of the problem without needing a full blown postgres
> > setup. I've quickly hacked something together, that seems to show the
> > problem nicely.
> >
> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > and the "IO Scheduling" bit in
> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> >
> > The tools output looks like this:
> > gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> > ...
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > commit[12155]: avg: 0.2 msec; max: 15.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > read[12157]: avg: 0.2 msec; max: 9.4 msec
> > ...
> > read[12165]: avg: 0.2 msec; max: 9.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > starting fsync() of files
> > finished fsync() of files
> > read[12162]: avg: 0.6 msec; max: 2765.5 msec
> >
> > So, the average read time is less than one ms (SSD, and about 50% cached
> > workload). But once another backend does the fsync(), read latency
> > skyrockets.
> >
> > A concurrent iostat shows the problem pretty clearly:
> > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> > sda 1.00 0.00 6322.00 337.00 51.73 4.38 17.26 2.09 0.32 0.19 2.59 0.14 90.00
> > sda 0.00 0.00 6016.00 303.00 47.18 3.95 16.57 2.30 0.36 0.23 3.12 0.15 94.40
> > sda 0.00 0.00 6236.00 1059.00 49.52 12.88 17.52 5.91 0.64 0.20 3.23 0.12 88.40
> > sda 0.00 0.00 105.00 26173.00 0.89 311.39 24.34 142.37 5.42 27.73 5.33 0.04 100.00
> > sda 0.00 0.00 78.00 27199.00 0.87 324.06 24.40 142.30 5.25 11.08 5.23 0.04 100.00
> > sda 0.00 0.00 10.00 33488.00 0.11 399.05 24.40 136.41 4.07 100.40 4.04 0.03 100.00
> > sda 0.00 0.00 3819.00 10096.00 31.14 120.47 22.31 42.80 3.10 0.32 4.15 0.07 96.00
> > sda 0.00 0.00 6482.00 346.00 52.98 4.53 17.25 1.93 0.28 0.20 1.80 0.14 93.20
> >
> > While the fsync() is going on (or the kernel decides to start writing
> > out aggressively for some other reason) the amount of writes to the disk
> > is increased by two orders of magnitude. Unsurprisingly with disastrous
> > consequences for read() performance. We really want a way to pace the
> > writes issued to the disk more regularly.
>
> Hi Andreas,
>
> I've finally dug myself out from under the backlog from LSFMM far
> enough to start testing this on my local IO performance test rig.
>
> tl;dr: I can't reproduce this peaky behaviour on my test rig.
>
> I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
> an unmodified benchmark on a current 3.15-linus tree. All storage
> (guest and host) is XFS based, guest VMs use virtio and direct IO to
> the backing storage. The host is using noop IO scheduling.
>
> The first IO setup I ran was a 100TB XFS filesystem in the guest.
> The backing file is a sparse file on an XFS filesystem on a pair of
> 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM. The SSDs are
> exported as JBOD from a RAID controller which has 1GB of FBWC. The
> guest is capable of sustaining around 65,000 random read IOPS and
> 40,000 write IOPS through this filesystem depending on the tests
> being run.
>
> The iostat output looks like this:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> vdc 0.00 0.00 1817.00 315.40 18.80 6.93 24.71 0.80 0.38 0.38 0.37 0.31 66.24
> vdc 0.00 0.00 2094.20 323.20 21.82 7.10 24.50 0.81 0.33 0.34 0.27 0.28 68.48
> vdc 0.00 0.00 1989.00 4500.20 20.50 56.64 24.34 24.82 3.82 0.43 5.32 0.12 80.16
> vdc 0.00 0.00 2019.80 320.80 20.83 7.05 24.39 0.83 0.35 0.36 0.32 0.29 69.04
> vdc 0.00 0.00 2206.60 323.20 22.57 7.10 24.02 0.87 0.34 0.34 0.33 0.28 71.92
> vdc 0.00 0.00 2437.20 329.60 25.79 7.24 24.45 0.83 0.30 0.30 0.27 0.26 71.76
> vdc 0.00 0.00 1224.40 11263.80 12.88 136.38 24.48 64.90 5.20 0.69 5.69 0.07 84.96
> vdc 0.00 0.00 2074.60 319.40 21.03 7.01 23.99 0.84 0.35 0.36 0.30 0.29 68.96
> vdc 0.00 0.00 1999.00 881.60 20.81 13.61 24.47 4.23 1.43 0.35 3.88 0.23 67.36
> vdc 0.00 0.00 1193.40 2273.00 12.42 29.68 24.87 11.70 3.40 0.53 4.91 0.18 60.96
> vdc 0.00 0.00 1724.00 314.40 17.80 6.91 24.83 0.74 0.36 0.36 0.35 0.30 61.52
> vdc 0.00 0.00 2605.20 325.60 24.67 7.15 22.24 0.90 0.31 0.31 0.25 0.25 72.72
> vdc 0.00 0.00 2340.60 324.40 23.85 7.12 23.80 0.83 0.31 0.31 0.29 0.26 68.96
> vdc 0.00 0.00 2749.60 329.40 28.51 7.24 23.78 0.90 0.29 0.30 0.25 0.24 75.04
> vdc 0.00 0.00 2619.60 324.60 27.72 7.13 24.24 0.88 0.30 0.30 0.28 0.24 71.76
> vdc 0.00 0.00 1608.60 4532.40 17.17 56.40 24.54 25.78 4.20 0.50 5.51 0.12 75.36
> vdc 0.00 0.00 2361.00 326.40 23.62 7.17 23.47 0.87 0.33 0.33 0.30 0.26 69.92
> vdc 0.00 0.00 2460.00 326.00 25.89 7.16 24.30 0.88 0.32 0.32 0.26 0.26 72.72
> vdc 0.00 0.00 2519.00 325.40 25.96 7.14 23.83 0.90 0.32 0.32 0.30 0.26 74.32
> vdc 0.00 0.00 2709.80 326.20 28.91 7.17 24.34 0.94 0.31 0.30 0.36 0.25 75.52
> vdc 0.00 0.00 2676.65 329.74 28.82 7.24 24.56 0.86 0.28 0.29 0.25 0.24 71.22
> vdc 0.00 0.00 1788.40 4506.80 18.77 56.66 24.54 23.22 3.69 0.45 4.97 0.12 74.88
> vdc 0.00 0.00 1850.40 319.60 19.76 7.02 25.28 0.80 0.37 0.37 0.34 0.30 64.80
>
> Its obvious where the fsyncs are hitting, but they are making
> almost no impact on the read performance. The benchmark is
> simple not generating enough dirty data to overload the IO
> subsystem, and hence there's no latency spikes to speak of.
>
> Benchmark output across fsyncs also demonstrates that:
>
> ....
> read[12494]: avg: 0.4 msec; max: 9.8 msec
> read[12499]: avg: 0.4 msec; max: 9.8 msec
> read[12495]: avg: 0.4 msec; max: 9.8 msec
> commit[12491]: avg: 0.0 msec; max: 6.3 msec
> wal[12491]: avg: 0.0 msec; max: 0.0 msec
> wal[12491]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> wal[12491]: avg: 0.0 msec; max: 0.7 msec
> commit[12491]: avg: 0.1 msec; max: 15.8 msec
> wal[12491]: avg: 0.0 msec; max: 0.1 msec
> wal[12491]: avg: 0.0 msec; max: 0.1 msec
> read[12492]: avg: 0.6 msec; max: 10.0 msec
> read[12496]: avg: 0.6 msec; max: 10.0 msec
> read[12507]: avg: 0.6 msec; max: 10.0 msec
> read[12505]: avg: 0.6 msec; max: 10.0 msec
> ....
>
> So, I though switching to spinning disks might show the problem.
> Same VM, this time using a 17TB linearly preallocated image file on
> an 18TB XFS filesystem on the host (virtio in the guest again) on a
> 12 disk RAID-0 (DM again) using a 12x2TB SAS drives exported as JBOD
> from a RAID controller with 512MB of BBWC. This is capable of about
> 2,000 random read IOPs, and write IOPS is dependent on the BBWC
> flushing behaviour (peaks at about 15,000, sustains 1500).
>
> Again, I don't see any bad behaviour:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> vdd 0.00 0.00 128.00 154.20 1.46 6.02 54.24 1.02 3.62 7.22 0.62 3.17 89.44
> vdd 0.00 0.00 133.80 154.40 1.43 6.03 53.04 1.08 3.76 7.42 0.59 3.05 87.84
> vdd 0.00 0.00 132.40 751.60 1.36 12.75 32.70 7.26 8.21 7.90 8.27 1.06 93.76
> vdd 0.00 0.00 138.60 288.60 1.47 7.62 43.57 2.69 6.30 7.24 5.85 2.15 91.68
> vdd 0.00 0.00 141.20 157.00 1.51 6.12 52.36 1.12 3.75 7.26 0.59 3.11 92.64
> vdd 0.00 0.00 136.80 154.00 1.52 6.02 53.09 1.04 3.57 6.92 0.60 3.10 90.08
> vdd 0.00 0.00 135.00 154.20 1.46 6.02 52.99 1.04 3.60 7.00 0.62 3.03 87.68
> vdd 0.00 0.00 147.20 155.80 1.50 6.09 51.26 1.19 3.92 7.45 0.59 3.06 92.72
> vdd 0.00 0.00 139.60 154.00 1.50 6.02 52.43 1.03 3.53 6.79 0.58 2.97 87.28
> vdd 0.00 0.00 126.80 913.80 1.27 14.81 31.63 7.87 7.56 9.02 7.36 0.89 92.16
> vdd 0.00 0.00 148.20 156.80 1.51 6.11 51.15 1.17 3.83 7.25 0.59 3.03 92.32
> vdd 0.00 0.00 142.60 155.80 1.49 6.09 52.03 1.12 3.76 7.23 0.60 3.05 90.88
> vdd 0.00 0.00 146.00 156.60 1.49 6.12 51.48 1.16 3.84 7.32 0.60 3.03 91.76
> vdd 0.00 0.00 148.80 157.20 1.56 6.14 51.54 1.10 3.58 6.78 0.54 2.92 89.36
> vdd 0.00 0.00 149.80 156.80 1.52 6.12 51.05 1.14 3.73 7.10 0.52 2.96 90.88
> vdd 0.00 0.00 127.20 910.60 1.29 14.82 31.80 8.17 7.87 8.62 7.77 0.88 91.44
> vdd 0.00 0.00 150.80 157.80 1.58 6.15 51.27 1.09 3.53 6.67 0.52 2.91 89.92
> vdd 0.00 0.00 153.00 156.60 1.53 6.12 50.56 1.17 3.80 7.07 0.60 2.95 91.20
> vdd 0.00 0.00 161.20 157.00 1.54 6.13 49.36 1.27 4.00 7.36 0.55 2.97 94.48
> vdd 0.00 0.00 152.10 157.09 1.55 6.14 50.91 1.15 3.73 7.06 0.50 2.98 92.22
> vdd 0.00 0.00 154.80 157.80 1.50 6.16 50.18 1.19 3.82 7.14 0.57 2.88 90.00
> vdd 0.00 0.00 122.20 922.80 1.24 14.93 31.70 8.30 7.94 8.81 7.83 0.90 93.68
> vdd 0.00 0.00 139.80 153.60 1.46 5.98 51.98 1.09 3.74 7.14 0.65 3.02 88.48
> vdd 0.00 0.00 134.60 153.60 1.44 6.00 52.87 1.09 3.78 7.38 0.63 3.15 90.88
> vdd 0.00 0.00 141.00 153.40 1.55 5.99 52.49 1.05 3.57 6.79 0.60 3.04 89.44
> vdd 0.00 0.00 130.20 154.20 1.44 6.02 53.76 1.04 3.65 7.28 0.59 3.14 89.44
> vdd 0.00 0.00 136.60 154.20 1.30 6.02 51.52 1.19 4.07 8.02 0.58 3.03 88.24
> vdd 0.00 0.00 119.00 872.20 1.29 14.29 32.20 7.98 8.05 8.59 7.98 0.93 92.64
> vdd 0.00 0.00 139.00 154.00 1.58 6.00 53.01 1.06 3.62 6.95 0.61 3.12 91.52
> vdd 0.00 0.00 146.00 155.40 1.52 6.07 51.60 1.12 3.71 7.05 0.57 3.01 90.72
> vdd 0.00 0.00 145.20 156.60 1.55 6.12 52.01 1.13 3.73 7.10 0.60 3.07 92.80
> vdd 0.00 0.00 147.80 156.80 1.51 6.12 51.36 1.13 3.72 7.07 0.56 2.97 90.56
> vdd 0.00 0.00 142.80 157.00 1.48 6.12 51.98 1.11 3.70 7.20 0.52 3.02 90.56
> vdd 0.00 0.00 125.20 922.00 1.36 14.92 31.85 8.34 7.96 8.47 7.90 0.88 92.56
>
> Again, you can see exactly where the fsyncs are hitting, and again
> they are not massive spikes of write IO. And read latencies are showing:
>
> read[12711]: avg: 6.1 msec; max: 182.2 msec
> ...
> read[12711]: avg: 6.0 msec; max: 198.8 msec
> ....
> read[12711]: avg: 5.9 msec; max: 158.8 msec
> ....
> read[12711]: avg: 6.1 msec; max: 127.5 msec
> ....
> read[12711]: avg: 5.9 msec; max: 355.6 msec
> ....
> read[12711]: avg: 6.0 msec; max: 262.8 msec
> ....
>
> No significant outliers. The same for commits and wal:
>
> wal[12701]: avg: 0.0 msec; max: 0.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 1.0 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> starting fsync() of files
> finished fsync() of files
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.1 msec; max: 129.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 0.9 msec
> wal[12701]: avg: 0.0 msec; max: 0.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 23.1 msec
>
> I'm not sure how you were generating the behaviour you reported, but
> the test program as it stands does not appear to be causing any
> problems at all on the sort of storage I'd expect large databases to
> be hosted on....
>
> I've tried a few tweaks to the test program, but I haven't been able
> to make it misbehave. What do I need to tweak in the test program or
> my test VM to make the kernel misbehave like you reported?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx`
> _______________________________________________
> Lsf mailing list
> Lsf@xxxxxxxxxxxxxxxxxxxxxxxxxx
> https://lists.linuxfoundation.org/mailman/listinfo/lsf
>

--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/