Re: [PATCH V3 1/1] ublk: add io_uring based userspace block driver

From: Ming Lei
Date: Tue Jul 05 2022 - 04:12:43 EST


On Tue, Jul 05, 2022 at 12:16:07PM +0800, Ziyang Zhang wrote:
> On 2022/7/5 06:10, Gabriel Krisman Bertazi wrote:
> > Ming Lei <ming.lei@xxxxxxxxxx> writes:
> >
> >> This is the driver part of userspace block driver(ublk driver), the other
> >> part is userspace daemon part(ublksrv)[1].
> >>
> >> The two parts communicate by io_uring's IORING_OP_URING_CMD with one
> >> shared cmd buffer for storing io command, and the buffer is read only for
> >> ublksrv, each io command is indexed by io request tag directly, and
> >> is written by ublk driver.
> >>
> >> For example, when one READ io request is submitted to ublk block driver, ublk
> >> driver stores the io command into cmd buffer first, then completes one
> >> IORING_OP_URING_CMD for notifying ublksrv, and the URING_CMD is issued to
> >> ublk driver beforehand by ublksrv for getting notification of any new io request,
> >> and each URING_CMD is associated with one io request by tag.
> >>
> >> After ublksrv gets the io command, it translates and handles the ublk io
> >> request, such as, for the ublk-loop target, ublksrv translates the request
> >> into same request on another file or disk, like the kernel loop block
> >> driver. In ublksrv's implementation, the io is still handled by io_uring,
> >> and share same ring with IORING_OP_URING_CMD command. When the target io
> >> request is done, the same IORING_OP_URING_CMD is issued to ublk driver for
> >> both committing io request result and getting future notification of new
> >> io request.
> >>
> >> Another thing done by ublk driver is to copy data between kernel io
> >> request and ublksrv's io buffer:
> >>
> >> 1) before ubsrv handles WRITE request, copy the request's data into
> >> ublksrv's userspace io buffer, so that ublksrv can handle the write
> >> request
> >>
> >> 2) after ubsrv handles READ request, copy ublksrv's userspace io buffer
> >> into this READ request, then ublk driver can complete the READ request
> >>
> >> Zero copy may be switched if mm is ready to support it.
> >>
> >> ublk driver doesn't handle any logic of the specific user space driver,
> >> so it should be small/simple enough.
> >>
> >> [1] ublksrv
> >>
> >> https://github.com/ming1/ubdsrv
> >>
> >> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> >
> > Hi Ming,
> >
> > A few comments inline:
> >
> >
> >> +#define UBLK_MINORS (1U << MINORBITS)
> >> +
> >> +struct ublk_rq_data {
> >> + struct callback_head work;
> >> +};
> >> +
> >> +/* io cmd is active: sqe cmd is received, and its cqe isn't done */
> >> +#define UBLK_IO_FLAG_ACTIVE 0x01
> >> +
> >> +/*
> >> + * FETCH io cmd is completed via cqe, and the io cmd is being handled by
> >> + * ublksrv, and not committed yet
> >> + */
> >> +#define UBLK_IO_FLAG_OWNED_BY_SRV 0x02
> >> +
> >
> > Minor nit: I wonder if the IO life cycle isn't better represented as a
> > state machine than flags:
> >
> > enum {
> > UBLK_IO_FREE,
> > UBLK_IO_QUEUED
> > UBLK_IO_OWNED_BY_SRV
> > UBLK_IO_COMPLETED,
> > UBLK_IO_ABORTED,
> > }
> >
> > Since currently, IO_FLAG_ACTIVE and IO_OWNED_BY_SRV should (almost) be
> > mutually exclusive.
> >
> >
> >> +
> >> +static int ublk_ctrl_stop_dev(struct ublk_device *ub)
> >> +{
> >> + ublk_stop_dev(ub);
> >> + cancel_work_sync(&ub->stop_work);
> >> + return 0;
> >> +}
> >> +
> >> +static inline bool ublk_queue_ready(struct ublk_queue *ubq)
> >> +{
> >> + return ubq->nr_io_ready == ubq->q_depth;
> >> +}
> >> +
> >> +/* device can only be started after all IOs are ready */
> >> +static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
> >> +{
> >> + mutex_lock(&ub->mutex);
> >> + ubq->nr_io_ready++;
> >
> > I think this is still problematic for the case where a FETCH_IO is sent
> > from a different thread than the one originally set in ubq_daemon
> > (i.e. a userspace bug). Since ubq_daemon is used to decide what task
> > context will do the data copy, If an IO_FETCH_RQ is sent to the same queue
> > from two threads, the data copy can happen in the context of the wrong
> > task. I'd suggest something like the check below at the beginning of
> > mark_io_ready and a similar on for IO_COMMIT_AND_FETCH_RQ
> >
> > mutex_lock(&ub->mutex);
> > if (ub->ubq_daemon && ub->ubq_daemon != current) {
> > mutex_unlock(&ub->mutex);
> > return -EINVAL;
> > }
> > ubq->nr_io_ready++;
> > ...
> >> + if (ublk_queue_ready(ubq)) {
> >> + ubq->ubq_daemon = current;
> >> + get_task_struct(ubq->ubq_daemon);
> >> + ub->nr_queues_ready++;
> >> + }
> >> + if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
> >> + complete_all(&ub->completion);
> >> + mutex_unlock(&ub->mutex);
> >> +}
> >> +
> >> +static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> >> +{
> >> + struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
> >> + struct ublk_device *ub = cmd->file->private_data;
> >> + struct ublk_queue *ubq;
> >> + struct ublk_io *io;
> >> + u32 cmd_op = cmd->cmd_op;
> >> + unsigned tag = ub_cmd->tag;
> >> + int ret = -EINVAL;
> >> +
> >> + pr_devel("%s: receieved: cmd op %d queue %d tag %d result %d\n",
> >
> > ^^^
> > received
> >
> >
> >> + __func__, cmd->cmd_op, ub_cmd->q_id, tag,
> >> + ub_cmd->result);
> >> +
> >> + if (!(issue_flags & IO_URING_F_SQE128))
> >> + goto out;
> >> +
> >> + ubq = ublk_get_queue(ub, ub_cmd->q_id);
> >> + if (!ubq || ub_cmd->q_id != ubq->q_id)
> >
> > q_id is coming from userspace and is used to access an array inside
> > ublk_get_queue(). I think you need to ensure qid < ub->dev_info.nr_hw_queues
> > before calling ublk_get_queue() to protect from a kernel bad memory
> > access triggered by userspace.
> >
> >> + goto out;
> >> +
> >> + if (WARN_ON_ONCE(tag >= ubq->q_depth))
> >
> > Userspace shouldn't be able to easily trigger a WARN_ON.
> >
> >> + goto out;
> >> +
> >> + io = &ubq->ios[tag];
> >> +
> >> + /* there is pending io cmd, something must be wrong */
> >> + if (io->flags & UBLK_IO_FLAG_ACTIVE) {b
> >> + ret = -EBUSY;
> >> + goto out;
> >> + }
> >> +
> >> + switch (cmd_op) {
> >> + case UBLK_IO_FETCH_REQ:
> >> + /* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
> >> + if (WARN_ON_ONCE(ublk_queue_ready(ubq))) {
> >
> > Likewise, this shouldn't trigger a WARN_ON, IMO.
> >
> >> + ret = -EBUSY;
> >> + goto out;
> >> + }
> >> + /*
> >> + * The io is being handled by server, so COMMIT_RQ is expected
> >> + * instead of FETCH_REQ
> >> + */
> >> + if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
> >> + goto out;
> >> + /* FETCH_RQ has to provide IO buffer */
> >> + if (!ub_cmd->addr)
> >> + goto out;
> >> + io->cmd = cmd;
> >> + io->flags |= UBLK_IO_FLAG_ACTIVE;
> >> + io->addr = ub_cmd->addr;
> >> +
> >> + ublk_mark_io_ready(ub, ubq);
> >> + break;
> >> + case UBLK_IO_COMMIT_AND_FETCH_REQ:
> >> + /* FETCH_RQ has to provide IO buffer */
> >> + if (!ub_cmd->addr)
> >> + goto out;
> >> + io->addr = ub_cmd->addr;
> >> + io->flags |= UBLK_IO_FLAG_ACTIVE;
> >> + fallthrough;
> >> + case UBLK_IO_COMMIT_REQ:
> >> + io->cmd = cmd;
> >> + if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
> >> + goto out;
> >> + ublk_commit_completion(ub, ub_cmd);
> >> +
> >> + /* COMMIT_REQ is supposed to not fetch req */
> >
> > I wonder if we could make it without IO_COMMIT_REQ. Is it useful to be
> > able to commit without fetching a new request?
>
> UBLK_IO_COMMIT_REQ is not necessary, IMO.
> In current version of ubd_drv.c I find UBLK_IO_COMMIT_REQ is sent by ublksrv
> after it gets one UBD_IO_RES_ABORT beacuse ubd_drv wants to abort IOs and let
> the ublk daemon exit.
>
> We can use UBLK_IO_COMMIT_AND_FETCH_REQ to replace UBLK_IO_COMMIT_REQ.
> The data flow could be:
>
> 1) UBLK_IO_COMMIT_AND_FETCH_REQ from ublksrv
>
> 2) ubd_drv receives IO's sqe with UBLK_IO_COMMIT_AND_FETCH_REQ
> and sets the IO's status to UBLK_IO_QUEUED

Not see where UBLK_IO_QUEUED is? :-)

>
> 3) ubd_drv wants to abort IOs so it just completes
> this IO's cqe(UBD_IO_RES_ABORT)
>
> I successfully removed UBLK_IO_COMMIT_REQ when developing libubd
> although I choose the earliest version of ubd_drv.c(v5.17-ubd-dev)
> which may be a buggy version.

You can verify it with latest ublksrv(libublksrv has been in master)
too by the following patch, then 'make test T=generic' can run
successfully.

diff --git a/lib/ublksrv.c b/lib/ublksrv.c
index c4bb2f4..aee71a0 100644
--- a/lib/ublksrv.c
+++ b/lib/ublksrv.c
@@ -150,7 +150,7 @@ static inline int ublksrv_queue_io_cmd(struct ublksrv_queue *q,
else
cmd_op = UBLK_IO_FETCH_REQ;
} else if (io->flags & UBLKSRV_NEED_COMMIT_RQ_COMP) {
- cmd_op = UBLK_IO_COMMIT_REQ;
+ cmd_op = UBLK_IO_COMMIT_AND_FETCH_REQ;
} else {
syslog(LOG_ERR, "io flags is zero, tag %d\n",
(int)cmd->tag);

ublk_cancel_queue() will cancel all these io commands.


Thanks,
Ming