Re: Performance testing of various barrier reduction patches [was:Re: [RFC v4] ext4: Coordinate fsync requests]

From: Ric Wheeler
Date: Fri Oct 08 2010 - 17:55:42 EST


On 10/08/2010 05:26 PM, Darrick J. Wong wrote:
On Mon, Sep 27, 2010 at 04:01:11PM -0700, Darrick J. Wong wrote:
Other than those regressions, the jbd2 fsync coordination is about as fast as
sending the flush directly from ext4. Unfortunately, where there _are_
regressions they seem rather large, which makes this approach (as implemented,
anyway) less attractive. Perhaps there is a better way to do it?
Hmm, not much chatter for two weeks. Either I've confused everyone with the
humongous spreadsheet, or ... something?

I've performed some more extensive performance and safety testing with the
fsync coordination patch. The results have been merged into the spreadsheet
that I linked to in the last email, though in general the results have not
really changed much at all.

I see two trends happening here with regards to comparing the use of jbd2 to
coordinate the flushes vs. measuring and coodinating flushes directly in ext4.
The first is that for loads that most benefit from having any kind of fsync
coordination (i.e. storage with slow flushes), the jbd2 approach provides the
same or slightly better performance than the direct approach. However, for
storage with fast flushes, the jbd2 approach seems to cause major slowdowns
even compared to not changing any code at all. To me this would suggest that
ext4 needs to coordinate the fsyncs directly, even at a higher code maintenance
cost, because a huge performance regression isn't good.

Other people in my group have been running their own performance comparisons
between no-coordination, jbd2-coordination, and direct-coordination, and what
I'm hearing is tha the direct-coordination mode is slightly faster than jbd2
coordination, though either are better than no coordination at all. Happily, I
haven't seen an increase in fsck complaints in my poweroff testing.

Given the nearness of the merge window, perhaps we ought to discuss this on
Monday's ext4 call? In the meantime I'll clean up the fsync coordination patch
so that it doesn't have so many debugging knobs and whistles.

Thanks,

--D

Hi Darrick,

We have been busily testing various combinations at Red Hat (we being not me :)), but here is one test that we used a long time back to validate the batching impact.

You need a slow, poky S-ATA drive - the slower it spins, the better.

A single fs_mark run against that drive should drive some modest number of files/sec with 1 thread:


[root@tunkums /]# fs_mark -s 20480 -n 500 -L 5 -d /test/foo

On my disk, I see:

5 500 20480 31.8 6213

Now run with 4 threads to give the code a chance to coalesce.

On my box, I see it jump up:

5 2000 20480 113.0 25092

And at 8 threads it jumps again:

5 4000 20480 179.0 49480

This work load is very device specific. On a very low latency device (arrays, high performance SSD), the coalescing "wait" time could be slower than just dispatching the command. Ext3/4 work done by Josef a few years back was meant to use high res timers to dynamically adjust that wait to avoid slowing down.

Have we tested the combined patchset with this?

Thanks!

Ric






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/