Re: [PATCH 2/2] nvme: add emulation for zone-append

From: Javier Gonzalez
Date: Thu Aug 20 2020 - 04:15:34 EST


On 20.08.2020 13:07, Kanchan Joshi wrote:
On Thu, Aug 20, 2020 at 3:22 AM Keith Busch <kbusch@xxxxxxxxxx> wrote:

On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver.

This is inaccurate. As another example, the spec optionally allows a
zone size to be a power of 2, but linux requires a power of 2 if you
want to use it with this driver.

> Provided there's no glaring technical issues

There are many. Some may be improved through a serious review process,
but the mess it makes out of the fast path is measurably harmful to
devices that don't subscribe to this. That issue is not so easily
remedied.

Since this patch is a copy of the scsi implementation, the reasonable
thing is take this fight to the generic block layer for a common
solution. We're not taking this in the nvme driver.

I sincerely want to minimize any adverse impact to the fast-path of
non-zoned devices.
My understanding of that aspect is (I feel it is good to confirm,
irrespective of the future of this patch):

1. Submission path:
There is no extra code for non-zoned device IO. For append, there is
this "ns->append_emulate = 1" check.
Snippet -
case REQ_OP_ZONE_APPEND:
- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+ if (!nvme_is_append_emulated(ns))
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+ else {
+ /* prepare append like write, and adjust lba
afterwards */

2. Completion:
Not as clean as submission for sure.
The extra check is "ns && ns->append_emulate == 1" for completions if
CONFIG_ZONED is enabled.
A slightly better completion-handling version (than the submitted patch) is -

- } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
- req_op(req) == REQ_OP_ZONE_APPEND) {
- req->__sector = nvme_lba_to_sect(req->q->queuedata,
- le64_to_cpu(nvme_req(req)->result.u64));
+ } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+ struct nvme_ns *ns = req->q->queuedata;
+ /* append-emulation requires wp update for some cmds*/
+ if (ns && nvme_is_append_emulated(ns)) {
+ if (nvme_need_zone_wp_update(req))
+ nvme_zone_wp_update(ns, req, status);
+ }
+ else if (req_op(req) == REQ_OP_ZONE_APPEND)
+ req->__sector = nvme_lba_to_sect(ns,
+ le64_to_cpu(nvme_req(req)->result.u64));

Am I missing any other fast-path pain-points?

A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows :
before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec)
after: IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec)

It is good to use the QEMU "simulation" path that we implemented to test
performance with different delays, etc., but for these numbers to make
sense we need to put them in contrast to the simulated NAND speed, etc.


This maynot be the best test to see the cost, and I am happy to
conduct more and improvise.

As for the volume of the code - it is same as SCSI emulation. And I
can make efforts to reduce that by moving common code to blk-zone, and
reuse in SCSI/NVMe emulation.
In the patch I tried to isolate append-emulation by keeping everything
into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns,
which can be turned into "driver_data".

+#ifdef CONFIG_BLK_DEV_ZONED
+struct nvme_za_emul {
+ unsigned int nr_zones;
+ spinlock_t zones_wp_offset_lock;
+ u32 *zones_wp_offset;
+ u32 *rev_wp_offset;
+ struct work_struct zone_wp_offset_work;
+ char *zone_wp_update_buf;
+ struct mutex rev_mutex;
+ struct nvme_ns *ns;
+};
+#endif

Will that be an acceptable line of further work/discussions?

I know we spent time enabling this path, but I don't think that moving
the discussion to the block layer will have much more benefit.

Let's keep the support for these non-append devices in xNVMe and focus
on the support for the append-enabled ones in Linux. We have a lot of
good stuff in the backlog that we can start pushing.