Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

From: Damien Le Moal
Date: Wed Aug 19 2020 - 07:17:56 EST


On 2020/08/19 19:32, Kanchan Joshi wrote:
> On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
>>
>> On 2020/08/19 18:27, Kanchan Joshi wrote:
>>> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig <hch@xxxxxx> wrote:
>>>>
>>>> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
>>>>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
>>>>
>>>> No, it is not.
>>>
>>> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
>>> I see that null-block zoned and SCSI-ZBC both set this requirement. I
>>> wonder how it became different for NVMe.
>>
>> It is not required for an NVMe ZNS drive that has zone append native support.
>> zonefs and upcoming btrfs do not use regular writes, removing the requirement
>> for zone write locking.
>
> I understand that if a particular user (zonefs, btrfs etc) is not
> sending regular-write and sending append instead, write-lock is not
> required.
> But if that particular user or some other user (say F2FS) sends
> regular write(s), write-lock is needed.

And that can be trivially enabled by setting the drive elevator to mq-deadline.

> Above block-layer, both the opcodes REQ_OP_WRITE and
> REQ_OP_ZONE_APPEND are available to be used by users. And I thought
> write-lock is taken or not is a per-opcode thing and not per-user (FS,
> MD/DM, user-space etc.), is not that correct? And MQ-deadline can
> cater to both the opcodes, while other schedulers cannot serve
> REQ_OP_WRITE well for zoned-device.

mq-deadline ignores zone append commands. No zone lock is taken for these. In
scsi, the emulation takes the zone lock before transforming the zone append into
a regular write. That locking is consistent with the mq-scheduler level locking
since the same lock bitmap is used. So if the user only issues zone append
writes, mq-deadline is not needed and there is no reasons to force its use by
setting ELEVATOR_F_ZBD_SEQ_WRITE. E.g. the user may want to use kyber...

>> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set only
>> and only if the drive does not have native zone append support.
>
> Sure I can keep it that way, once I get it right. If it is really not
> required for native-append drive, it should not be here at the place
> where I added.
>
>> And even in that
>> case, since for an emulated zone append the zone write lock is taken and
>> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required
>> only if the user will also be issuing regular writes at high QD. And that is
>> trivially controllable by the user by simply setting the drive elevator to
>> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed.
>
> Are we saying applications should switch schedulers based on the write
> QD (use any-scheduler for QD1 and mq-deadline for QD-N).
> Even if it does that, it does not know what other applications would
> be doing. That seems hard-to-get-right and possible only in a
> tightly-controlled environment.

Even for SMR, the user is free to set the elevator to none, which disables zone
write locking. Issuing writes correctly then becomes the responsibility of the
application. This can be useful for settings that for instance use NCQ I/O
priorities, which give better results when "none" is used.

As far as I know, zoned drives are always used in tightly controlled
environments. Problems like "does not know what other applications would be
doing" are non-existent. Setting up the drive correctly for the use case at hand
is a sysadmin/server setup problem, based on *the* application (singular)
requirements.


--
Damien Le Moal
Western Digital Research