Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - goingbeyond 4096 bytes

From: Chris Mason
Date: Wed Jan 22 2014 - 13:37:45 EST

On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote:
+AD4- On Wed, 2014-01-22 at 18:02 +-0000, Chris Mason wrote:

+AD4- +AD4- We're likely to have people mixing 4K drives and +ADw-fill in some other
+AD4- +AD4- size here+AD4- on the same box. We could just go with the biggest size and
+AD4- +AD4- use the existing bh code for the sub-pagesized blocks, but I really
+AD4- +AD4- hesitate to change VM fundamentals for this.
+AD4- If the page cache had a variable granularity per device, that would cope
+AD4- with this. It's the variable granularity that's the VM problem.

Agreed. But once we go variable granularity we're basically talking the
large order allocation problem.

+AD4- +AD4- From a pure code point of view, it may be less work to change it once in
+AD4- +AD4- the VM. But from an overall system impact point of view, it's a big
+AD4- +AD4- change in how the system behaves just for filesystem metadata.
+AD4- Agreed, but only if we don't do RMW in the buffer cache ... which may be
+AD4- a good reason to keep it.
+AD4- +AD4- +AD4- The other question is if the drive does RMW between 4k and whatever its
+AD4- +AD4- +AD4- physical sector size, do we need to do anything to take advantage of
+AD4- +AD4- +AD4- it ... as in what would altering the granularity of the page cache buy
+AD4- +AD4- +AD4- us?
+AD4- +AD4-
+AD4- +AD4- The real benefit is when and how the reads get scheduled. We're able to
+AD4- +AD4- do a much better job pipelining the reads, controlling our caches and
+AD4- +AD4- reducing write latency by having the reads done up in the OS instead of
+AD4- +AD4- the drive.
+AD4- I agree with all of that, but my question is still can we do this by
+AD4- propagating alignment and chunk size information (i.e. the physical
+AD4- sector size) like we do today. If the FS knows the optimal I/O patterns
+AD4- and tries to follow them, the odd cockup won't impact performance
+AD4- dramatically. The real question is can the FS make use of this layout
+AD4- information +ACo-without+ACo- changing the page cache granularity? Only if you
+AD4- answer me +ACI-no+ACI- to this do I think we need to worry about changing page
+AD4- cache granularity.

Can it mostly work? I think the answer is yes. If not we'd have a lot
of miserable people on top of raid5/6 right now. We can always make a
generic r/m/w engine in DM that supports larger sectors transparently.

+AD4- Realistically, if you look at what the I/O schedulers output on a
+AD4- standard (spinning rust) workload, it's mostly large transfers.
+AD4- Obviously these are misalgned at the ends, but we can fix some of that
+AD4- in the scheduler. Particularly if the FS helps us with layout. My
+AD4- instinct tells me that we can fix 99+ACU- of this with layout on the FS +- io
+AD4- schedulers ... the remaining 1+ACU- goes to the drive as needing to do RMW
+AD4- in the device, but the net impact to our throughput shouldn't be that
+AD4- great.

There are a few workloads where the VM and the FS would team up to make
this fairly miserable

Small files. Delayed allocation fixes a lot of this, but the VM doesn't
realize that fileA, fileB, fileC, and fileD all need to be written at
the same time to avoid RMW. Btrfs and MD have setup plugging callbacks
to accumulate full stripes as much as possible, but it still hurts.

Metadata. These writes are very latency sensitive and we'll gain a lot
if the FS is explicitly trying to build full sector IOs.

I do agree that its very likely these drives are going to silently rmw
in the background for us.

Circling back to what we might talk about at the conference, Ric do you
have any ideas on when these drives might hit the wild?


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at