Re: [PATCH v4 6/6] io_uring: add support for zone-append

From: Damien Le Moal
Date: Fri Jul 31 2020 - 05:34:57 EST


On 2020/07/31 18:14, hch@xxxxxxxxxxxxx wrote:
> On Fri, Jul 31, 2020 at 08:14:22AM +0000, Damien Le Moal wrote:
>>
>>> This was one of the reason why we chose to isolate the operation by a
>>> different IOCB flag and not by IOCB_APPEND alone.
>>
>> For zonefs, the plan is:
>> * For the sync write case, zone append is always used.
>> * For the async write case, if we see IOCB_APPEND, then zone append BIOs are
>> used. If not, regular write BIOs are used.
>>
>> Simple enough I think. No need for a new flag.
>
> Simple, but wrong. Sync vs async really doesn't matter, even sync
> writes will have problems if there are other writers. We need a flag
> for "report the actual offset for appending writes", and based on that
> flag we need to not allow short writes (or split extents for real
> file systems). We also need a fcntl or ioctl to report this max atomic
> write size so that applications can rely on it.
>

Sync writes are done under the inode lock, so there cannot be other writers at
the same time. And for the sync case, since the actual written offset is
necessarily equal to the file size before the write, there is no need to report
it (there is no system call that can report that anyway). For this sync case,
the only change that the use of zone append introduces compared to regular
writes is the potential for more short writes.

Adding a flag for "report the actual offset for appending writes" is fine with
me, but do you also mean to use this flag for driving zone append write vs
regular writes in zonefs ?

The fcntl or ioctl for getting the max atomic write size would be fine too.
Given that zonefs is very close to the underlying zoned drive, I was assuming
that the application can simply consult the device sysfs zone_append_max_bytes
queue attribute. For regular file systems, this value would be used internally
only. I do not really see how it can be useful to applications. Furthermore, the
file system may have a hard time giving that information to the application
depending on its underlying storage configuration (e.g. erasure
coding/declustered RAID).



--
Damien Le Moal
Western Digital Research