I agree that when you don't set the sector size to 16k you are not forcing the
filesystem to use 16k IOs, the metadata can still be 4k. But when you
use a 16k sector size, the 16k IOs should be respected by the
filesystem.
Do we break BIOs to below a min order if the sector size is also set to
16k? I haven't seen that and its unclear when or how that could happen.
At least for NVMe we don't need to yell to a device to inform it we want
a 16k IO issued to it to be atomic, if we read that it has the
capability for it, it just does it. The IO verificaiton can be done with
blkalgn [0].
Does SCSI*require* an 16k atomic prep work, or can it be done implicitly?
Does it need WRITE_ATOMIC_16?
[0]https://urldefense.com/v3/__https://github.com/dagmcr/bcc/tree/blkalgn__;!!ACWV5N9M2RV99hQ!I0tfdPsEq9vdHMSC7JVmVDHCb5w6invjudW7pZW50v3mZ7dWMMf0cBtY_BQlZZmYSjLzPQDZoLO7-K6MQQ$So just increasing the inode block size / FS block size does notIf we're breaking up IOs when a min order is set for an inode, that
really change anything, in itself.
would need to be looked into, but we're not seeing that.
I suspect IO verification with the above tool should prove to show theDo untorn writes actually exist in SCSI? I was under the impressionI know that some SCSI targets actually atomically write data in chunks >
nobody had actually implemented them in SCSI hardware.
LBS. Obviously atomic vs non-atomic performance is a moot point there, as
data is implicitly always atomically written.
We actually have an mysql/innodb port of this API working on such a SCSI
target.
same if you use a filesystem with a larger sector size set too, and you
just would not have to do any changes to userspace other than the
filesystem creation with say mkfs.xfs params of -b size=16k -s size=16k
However I am not sure about atomic write support for other SCSI targets.Good to know!
Would using the same min and max order for the inode work instead?To me, O_ATOMIC would be required for buffered atomic writes IO, as we wantWe saw untorn writes as not being a property of the file or even the inodeThe problem is that keeping track of that is expensive for buffered
itself, but rather an attribute of the specific IO being issued from the
userspace application.
writes. It's a model that only works for direct IO. Arguably we
could make it work for O_SYNC buffered IO, but that'll require some
surgery.
a fixed-sized IO, so that would mean no mixing of atomic and non-atomic IO.