[PATCH 0/9] RFC: NVME VFIO mediated device

From: Maxim Levitsky
Date: Tue Mar 19 2019 - 10:58:27 EST


Oops, I placed the subject in the wrong place.

Best regards,
Maxim Levitsky

On Tue, 2019-03-19 at 16:41 +0200, Maxim Levitsky wrote:
> Date: Tue, 19 Mar 2019 14:45:45 +0200
> Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
>
> Hi everyone!
>
> In this patch series, I would like to introduce my take on the problem of
> doing
> as fast as possible virtualization of storage with emphasis on low latency.
>
> In this patch series I implemented a kernel vfio based, mediated device that
> allows the user to pass through a partition and/or whole namespace to a guest.
>
> The idea behind this driver is based on paper you can find at
> https://www.usenix.org/conference/atc18/presentation/peng,
>
> Although note that I stared the development prior to reading this paper,
> independently.
>
> In addition to that implementation is not based on code used in the paper as
> I wasn't being able at that time to make the source available to me.
>
> ***Key points about the implementation:***
>
> * Polling kernel thread is used. The polling is stopped after a
> predefined timeout (1/2 sec by default).
> Support for all interrupt driven mode is planned, and it shows promising
> results.
>
> * Guest sees a standard NVME device - this allows to run guest with
> unmodified drivers, for example windows guests.
>
> * The NVMe device is shared between host and guest.
> That means that even a single namespace can be split between host
> and guest based on different partitions.
>
> * Simple configuration
>
> *** Performance ***
>
> Performance was tested on Intel DC P3700, With Xeon E5-2620 v2
> and both latency and throughput is very similar to SPDK.
>
> Soon I will test this on a better server and nvme device and provide
> more formal performance numbers.
>
> Latency numbers:
> ~80ms - spdk with fio plugin on the host.
> ~84ms - nvme driver on the host
> ~87ms - mdev-nvme + nvme driver in the guest
>
> Throughput was following similar pattern as well.
>
> * Configuration example
> $ modprobe nvme mdev_queues=4
> $ modprobe nvme-mdev
>
> $ UUID=$(uuidgen)
> $ DEVICE='device pci address'
> $ echo $UUID > /sys/bus/pci/devices/$DEVICE/mdev_supported_types/nvme-
> 2Q_V1/create
> $ echo n1p3 > /sys/bus/mdev/devices/$UUID/namespaces/add_namespace #attach
> host namespace 1 parition 3
> $ echo 11 > /sys/bus/mdev/devices/$UUID/settings/iothread_cpu #pin the io
> thread to cpu 11
>
> Afterward boot qemu with
> -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID
>
> Zero configuration on the guest.
>
> *** FAQ ***
>
> * Why to make this in the kernel? Why this is better that SPDK
>
> -> Reuse the existing nvme kernel driver in the host. No new drivers in the
> guest.
>
> -> Share the NVMe device between host and guest.
> Even in fully virtualized configurations,
> some partitions of nvme device could be used by guests as block devices
> while others passed through with nvme-mdev to achieve balance between
> all features of full IO stack emulation and performance.
>
> -> NVME-MDEV is a bit faster due to the fact that in-kernel driver
> can send interrupts to the guest directly without a context
> switch that can be expensive due to meltdown mitigation.
>
> -> Is able to utilize interrupts to get reasonable performance.
> This is only implemented
> as a proof of concept and not included in the patches,
> but interrupt driven mode shows reasonable performance
>
> -> This is a framework that later can be used to support NVMe devices
> with more of the IO virtualization built-in
> (IOMMU with PASID support coupled with device that supports it)
>
> * Why to attach directly to nvme-pci driver and not use block layer IO
> -> The direct attachment allows for better performance, but I will
> check the possibility of using block IO, especially for fabrics drivers.
>
> *** Implementation notes ***
>
> * All guest memory is mapped into the physical nvme device
> but not 1:1 as vfio-pci would do this.
> This allows very efficient DMA.
> To support this, patch 2 adds ability for a mdev device to listen on
> guest's memory map events.
> Any such memory is immediately pinned and then DMA mapped.
> (Support for fabric drivers where this is not possible exits too,
> in which case the fabric driver will do its own DMA mapping)
>
> * nvme core driver is modified to announce the appearance
> and disappearance of nvme controllers and namespaces,
> to which the nvme-mdev driver is subscribed.
>
> * nvme-pci driver is modified to expose raw interface of attaching to
> and sending/polling the IO queues.
> This allows the mdev driver very efficiently to submit/poll for the IO.
> By default one host queue is used per each mediated device.
> (support for other fabric based host drivers is planned)
>
> * The nvme-mdev doesn't assume presence of KVM, thus any VFIO user, including
> SPDK, a qemu running with tccg, ... can use this virtual device.
>
> *** Testing ***
>
> The device was tested with stock QEMU 3.0 on the host,
> with host was using 5.0 kernel with nvme-mdev added and the following
> hardware:
> * QEMU nvme virtual device (with nested guest)
> * Intel DC P3700 on Xeon E5-2620 v2 server
> * Samsung SM981 (in a Thunderbolt enclosure, with my laptop)
> * Lenovo NVME device found in my laptop
>
> The guest was tested with kernel 4.16, 4.18, 4.20 and
> the same custom complied kernel 5.0
> Windows 10 guest was tested too with both Microsoft's inbox driver and
> open source community NVME driver
> (https://lists.openfabrics.org/pipermail/nvmewin/2016-December/001420.html)
>
> Testing was mostly done on x86_64, but 32 bit host/guest combination
> was lightly tested too.
>
> In addition to that, the virtual device was tested with nested guest,
> by passing the virtual device to it,
> using pci passthrough, qemu userspace nvme driver, and spdk
>
>
> PS: I used to contribute to the kernel as a hobby using the
> maximlevitsky@xxxxxxxxx address
>
> Maxim Levitsky (9):
> vfio/mdev: add .request callback
> nvme/core: add some more values from the spec
> nvme/core: add NVME_CTRL_SUSPENDED controller state
> nvme/pci: use the NVME_CTRL_SUSPENDED state
> nvme/pci: add known admin effects to augument admin effects log page
> nvme/pci: init shadow doorbell after each reset
> nvme/core: add mdev interfaces
> nvme/core: add nvme-mdev core driver
> nvme/pci: implement the mdev external queue allocation interface
>
> MAINTAINERS | 5 +
> drivers/nvme/Kconfig | 1 +
> drivers/nvme/Makefile | 1 +
> drivers/nvme/host/core.c | 149 +++++-
> drivers/nvme/host/nvme.h | 55 ++-
> drivers/nvme/host/pci.c | 385 ++++++++++++++-
> drivers/nvme/mdev/Kconfig | 16 +
> drivers/nvme/mdev/Makefile | 5 +
> drivers/nvme/mdev/adm.c | 873 ++++++++++++++++++++++++++++++++++
> drivers/nvme/mdev/events.c | 142 ++++++
> drivers/nvme/mdev/host.c | 491 +++++++++++++++++++
> drivers/nvme/mdev/instance.c | 802 +++++++++++++++++++++++++++++++
> drivers/nvme/mdev/io.c | 563 ++++++++++++++++++++++
> drivers/nvme/mdev/irq.c | 264 ++++++++++
> drivers/nvme/mdev/mdev.h | 56 +++
> drivers/nvme/mdev/mmio.c | 591 +++++++++++++++++++++++
> drivers/nvme/mdev/pci.c | 247 ++++++++++
> drivers/nvme/mdev/priv.h | 700 +++++++++++++++++++++++++++
> drivers/nvme/mdev/udata.c | 390 +++++++++++++++
> drivers/nvme/mdev/vcq.c | 207 ++++++++
> drivers/nvme/mdev/vctrl.c | 514 ++++++++++++++++++++
> drivers/nvme/mdev/viommu.c | 322 +++++++++++++
> drivers/nvme/mdev/vns.c | 356 ++++++++++++++
> drivers/nvme/mdev/vsq.c | 178 +++++++
> drivers/vfio/mdev/vfio_mdev.c | 11 +
> include/linux/mdev.h | 4 +
> include/linux/nvme.h | 88 +++-
> 27 files changed, 7375 insertions(+), 41 deletions(-)
> create mode 100644 drivers/nvme/mdev/Kconfig
> create mode 100644 drivers/nvme/mdev/Makefile
> create mode 100644 drivers/nvme/mdev/adm.c
> create mode 100644 drivers/nvme/mdev/events.c
> create mode 100644 drivers/nvme/mdev/host.c
> create mode 100644 drivers/nvme/mdev/instance.c
> create mode 100644 drivers/nvme/mdev/io.c
> create mode 100644 drivers/nvme/mdev/irq.c
> create mode 100644 drivers/nvme/mdev/mdev.h
> create mode 100644 drivers/nvme/mdev/mmio.c
> create mode 100644 drivers/nvme/mdev/pci.c
> create mode 100644 drivers/nvme/mdev/priv.h
> create mode 100644 drivers/nvme/mdev/udata.c
> create mode 100644 drivers/nvme/mdev/vcq.c
> create mode 100644 drivers/nvme/mdev/vctrl.c
> create mode 100644 drivers/nvme/mdev/viommu.c
> create mode 100644 drivers/nvme/mdev/vns.c
> create mode 100644 drivers/nvme/mdev/vsq.c
>