Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

From: Ric Wheeler
Date: Thu Mar 17 2016 - 09:49:24 EST


On 03/16/2016 06:23 PM, Chris Mason wrote:
On Tue, Mar 15, 2016 at 05:51:17PM -0700, Chris Mason wrote:
On Tue, Mar 15, 2016 at 07:30:14PM -0500, Eric Sandeen wrote:
On 3/15/16 7:06 PM, Linus Torvalds wrote:
On Tue, Mar 15, 2016 at 4:52 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
It is pretty clear that the onus is on the patch submitter to
provide justification for inclusion, not for the reviewer/Maintainer
to have to prove that the solution is unworkable.
I agree, but quite frankly, performance is a good justification.

So if Ted can give performance numbers, that's justification enough.
We've certainly taken changes with less.
I've been away from ext4 for a while, so I'm really not on top of the
mechanics of the underlying problem at the moment.

But I would say that in addition to numbers showing that ext4 has trouble
with unwritten extent conversion, we should have an explanation of
why it can't be solved in a way that doesn't open up these concerns.

XFS certainly has different mechanisms, but is the demonstrated workload
problematic on XFS (or btrfs) as well? If not, can ext4 adopt any of the
solutions that make the workload perform better on other filesystems?
When I've benchmarked this in the past, doing small random buffered writes
into an preallocated extent was dramatically (3x or more) slower on xfs
than doing them into a fully written extent. That was two years ago,
but I can redo it.
So I re-ran some benchmarks, with 4K O_DIRECT random ios on nvme (4.5
kernel). This is O_DIRECT without O_SYNC. I don't think xfs will do
commits for each IO into the prealloc file? O_SYNC makes it much
slower, so hopefully I've got this right.

The test runs for 60 seconds, and I used an iodepth of 4:

prealloc file: 32,000 iops
overwrite: 121,000 iops

If I bump the iodepth up to 512:

prealloc file: 33,000 iops
overwrite: 279,000 iops

For streaming writes, XFS converts prealloc to written much better when
the IO isn't random. You can start seeing the difference at 16K
sequential O_DIRECT writes, but really its not a huge impact. The worst
case is 4K:

prealloc file: 227MB/s
overwrite: 340MB/s

I can't think of sequential workloads where this will matter, since they
will either end up with bigger IO or the performance impact won't get
noticed.

-chris

I think that these numbers are the interesting ones, see a 4x slow down is certainly significant.

If you do the same patch after hacking XFS preallocation as Dave suggested with xfs_db, do we get most of the performance back?

Ric