Re: [PATCH v4 6/6] io_uring: add support for zone-append

From: hch@xxxxxxxxxxxxx
Date: Fri Jul 31 2020 - 05:41:46 EST


On Fri, Jul 31, 2020 at 09:34:50AM +0000, Damien Le Moal wrote:
> Sync writes are done under the inode lock, so there cannot be other writers at
> the same time. And for the sync case, since the actual written offset is
> necessarily equal to the file size before the write, there is no need to report
> it (there is no system call that can report that anyway). For this sync case,
> the only change that the use of zone append introduces compared to regular
> writes is the potential for more short writes.
>
> Adding a flag for "report the actual offset for appending writes" is fine with
> me, but do you also mean to use this flag for driving zone append write vs
> regular writes in zonefs ?

Let's keep semantics and implementation separate. For the case
where we report the actual offset we need a size imitation and no
short writes.

Anything with those semantics can be implemented using Zone Append
trivially in zonefs, and we don't even need the exclusive lock in that
case. But even without that flag anything that has an exclusive lock can
at least in theory be implemented using Zone Append, it just need
support for submitting another request from the I/O completion handler
of the first. I just don't think it is worth it - with the exclusive
lock we do have access to the zone serialied so a normal write works
just fine. Both for the sync and async case.

> The fcntl or ioctl for getting the max atomic write size would be fine too.
> Given that zonefs is very close to the underlying zoned drive, I was assuming
> that the application can simply consult the device sysfs zone_append_max_bytes
> queue attribute.

For zonefs we can, yes. But in many ways that is a lot more cumbersome
that having an API that works on the fd you want to write on.

> For regular file systems, this value would be used internally
> only. I do not really see how it can be useful to applications. Furthermore, the
> file system may have a hard time giving that information to the application
> depending on its underlying storage configuration (e.g. erasure
> coding/declustered RAID).

File systems might have all kinds of limits of their own (e.g. extent
sizes). And a good API that just works everywhere and is properly
documented is much better than heaps of cargo culted crap all over
applications.