Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - goingbeyond 4096 bytes

From: David Lang
Date: Wed Jan 22 2014 - 21:46:47 EST


On Wed, 22 Jan 2014, Chris Mason wrote:

On Wed, 2014-01-22 at 11:50 -0800, Andrew Morton wrote:
On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:

But this, I think, is the fundamental point for debate. If we can pull
alignment and other tricks to solve 99% of the problem is there a need
for radical VM surgery? Is there anything coming down the pipe in the
future that may move the devices ahead of the tricks?

I expect it would be relatively simple to get large blocksizes working
on powerpc with 64k PAGE_SIZE. So before diving in and doing huge
amounts of work, perhaps someone can do a proof-of-concept on powerpc
(or ia64) with 64k blocksize.


Maybe 5 drives in raid5 on MD, with 4K coming from each drive. Well
aligned 16K IO will work, everything else will about the same as a rmw
from a single drive.

I think this is the key point to think about here. How will these new hard drive large block sizes differ from RAID stripes and SSD eraseblocks?

In all of these cases there are very clear advantages to doing the writes in properly sized and aligned chunks that correspond with the underlying structure to avoid the RMW overhead.

It's extremely unlikely that drive manufacturers will produce drives that won't work with any existing OS, so they are going to support smaller writes in firmware. If they don't, they won't be able to sell their drives to anyone running existing software. Given the Enterprise software upgrade cycle compared to the expanding storage needs, whatever they ship will have to work on OS and firmware releases that happened several years ago.

I think what is needed is some way to be able to get a report on how man RMW cycles have to happen. Then people can work on ways to reduce this number and measure the results.

I don't know if md and dm are currently smart enough to realize that the entire stripe is being overwritten and avoid the RMW cycle. If they can't, I would expect that once we start measuring it, they will gain such support.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/