Re: [RFC 3/8] nvmet: Use p2pmem in nvme target

From: Sagi Grimberg
Date: Thu Apr 06 2017 - 01:47:59 EST



I hadn't done this yet but I think a simple closest device in the tree
would solve the issue sufficiently. However, I originally had it so the
user has to pick the device and I prefer that approach. But if the user
picks the device, then why bother restricting what he picks?

Because the user can get it wrong, and its our job to do what we can in
order to prevent the user from screwing itself.

Per the
thread with Sinan, I'd prefer to use what the user picks. You were one
of the biggest opponents to that so I'd like to hear your opinion on
removing the restrictions.

I wasn't against it that much, I'm all for making things "just work"
with minimal configuration steps, but I'm not sure we can get it
right without it.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Even if it was available, it would be hard to make real use of this
given that we wouldn't know how to pre-post recv buffers (for in-capsule
data). But let's leave this out of the scope entirely...

I don't understand what you're referring to. We'd simply use the CMB
buffer as a p2pmem device, why does that change anything?

I'm referring to the in-capsule data buffers pre-posts that we do.
Because we prepare a buffer that would contain in-capsule data, we have
no knowledge to which device the incoming I/O is directed to, which
means we can (and will) have I/O where the data lies in CMB of device
A but it's really targeted to device B - which sorta defeats the purpose
of what we're trying to optimize here...

Why do you need this? you have a reference to the
queue itself.

This keeps track of whether the response was actually allocated with
p2pmem or not. It's needed for when we free the SGL because the queue
may have a p2pmem device assigned to it but, if the alloc failed and it
fell back on system memory then we need to know how to free it. I'm
currently looking at having SGLs having an iomem flag. In which case,
this would no longer be needed as the flag in the SGL could be used.

That would be better, maybe...

[...]

This is a problem. namespaces can be added at any point in time. No one
guarantee that dma_devs are all the namepaces we'll ever see.

Yeah, well restricting p2pmem based on all the devices in use is hard.
So we'd need a call into the transport every time an ns is added and
we'd have to drop the p2pmem if they add one that isn't supported. This
complexity is just one of the reasons I prefer just letting the user chose.

Still the user can get it wrong. Not sure we can get a way without
keeping track of this as new devices join the subsystem.

+
+ if (queue->p2pmem)
+ pr_debug("using %s for rdma nvme target queue",
+ dev_name(&queue->p2pmem->dev));
+
+ kfree(dma_devs);
+}
+
static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
struct rdma_cm_event *event)
{
@@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
rdma_cm_id *cm_id,
}
queue->port = cm_id->context;

+ nvmet_rdma_queue_setup_p2pmem(queue);
+

Why is all this done for each queue? looks completely redundant to me.

A little bit. Where would you put it?

I think we'll need a representation of a controller in nvmet-rdma for
that. we sort of got a way without it so far, but I don't think we can
anymore with this.

ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
if (ret)
goto release_queue;

You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
curious why?

Yes, the thinking was that these transfers were small anyway so there
would not be significant benefit to pushing them through p2pmem. There's
really no reason why we couldn't do that if it made sense to though.

I don't see an urgent reason for it too. I was just curious...