Re: [PATCH RFC] Persistent grant maps for xen blk drivers

From: Roger Pau Monné
Date: Tue Oct 23 2012 - 12:07:31 EST


On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote:
> On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote:
>> This patch implements persistent grants for the xen-blk{front,back}
>> mechanism. The effect of this change is to reduce the number of unmap
>> operations performed, since they cause a (costly) TLB shootdown. This
>> allows the I/O performance to scale better when a large number of VMs
>> are performing I/O.
>>
>> Previously, the blkfront driver was supplied a bvec[] from the request
>> queue. This was granted to dom0; dom0 performed the I/O and wrote
>> directly into the grant-mapped memory and unmapped it; blkfront then
>> removed foreign access for that grant. The cost of unmapping scales
>> badly with the number of CPUs in Dom0. An experiment showed that when
>> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
>> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
>> (at which point 650,000 IOPS are being performed in total). If more
>> than 5 guests are used, the performance declines. By 10 guests, only
>> 400,000 IOPS are being performed.
>>
>> This patch improves performance by only unmapping when the connection
>> between blkfront and back is broken.
>>
>> On startup blkfront notifies blkback that it is using persistent
>> grants, and blkback will do the same. If blkback is not capable of
>> persistent mapping, blkfront will still use the same grants, since it
>> is compatible with the previous protocol, and simplifies the code
>> complexity in blkfront.
>>
>> To perform a read, in persistent mode, blkfront uses a separate pool
>> of pages that it maps to dom0. When a request comes in, blkfront
>> transmutes the request so that blkback will write into one of these
>> free pages. Blkback keeps note of which grefs it has already
>> mapped. When a new ring request comes to blkback, it looks to see if
>> it has already mapped that page. If so, it will not map it again. If
>> the page hasn't been previously mapped, it is mapped now, and a record
>> is kept of this mapping. Blkback proceeds as usual. When blkfront is
>> notified that blkback has completed a request, it memcpy's from the
>> shared memory, into the bvec supplied. A record that the {gref, page}
>> tuple is mapped, and not inflight is kept.
>>
>> Writes are similar, except that the memcpy is peformed from the
>> supplied bvecs, into the shared pages, before the request is put onto
>> the ring.
>>
>> Blkback stores a mapping of grefs=>{page mapped to by gref} in
>> a red-black tree. As the grefs are not known apriori, and provide no
>> guarantees on their ordering, we have to perform a search
>> through this tree to find the page, for every gref we receive. This
>> operation takes O(log n) time in the worst case.
>
> Might want to mention how blkfront stores it as well. It looks
> to be using 'llist' instead of 'list'? Any particular reason?

Since we are just pushing and poping grant references, I went for what I
think is the simplest one, a single linked list (list is a doubly linked
list). Oliver in the previous version was using something similar, but
custom made. I think it's best to use the data structures provided by
the kernel itself.

>
>>
>> The maximum number of grants that blkback will persistenly map is
>> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
>> prevent a malicios guest from attempting a DoS, by supplying fresh
>> grefs, causing the Dom0 kernel to map excessively. If a guest
>> is using persistent grants and exceeds the maximum number of grants to
>> map persistenly the newly passed grefs will be mapped and unmaped.
>> Using this approach, we can have requests that mix persistent and
>> non-persistent grants, and we need to handle them correctly.
>> This allows us to set the maximum number of persistent grants to a
>> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
>> setting it will lead to unpredictable performance.
>>
>> In writing this patch, the question arrises as to if the additional
>> cost of performing memcpys in the guest (to/from the pool of granted
>> pages) outweigh the gains of not performing TLB shootdowns. The answer
>> to that question is `no'. There appears to be very little, if any
>> additional cost to the guest of using persistent grants. There is
>> perhaps a small saving, from the reduced number of hypercalls
>> performed in granting, and ending foreign access.
>>
>> Signed-off-by: Oliver Chick <oliver.chick@xxxxxxxxxx>
>> Signed-off-by: Roger Pau Monne <roger.pau@xxxxxxxxxx>
>> Cc: <konrad.wilk@xxxxxxxxxx>
>> Cc: <linux-kernel@xxxxxxxxxxxxxxx>
>> ---
>> Benchmarks showing the impact of this patch in blk performance can be
>> found at:
>>
>> http://xenbits.xensource.com/people/royger/persistent_grants/
>> ---
>> drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++---
>> drivers/block/xen-blkback/common.h | 17 ++
>> drivers/block/xen-blkback/xenbus.c | 16 ++-
>> drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++----
>> 4 files changed, 442 insertions(+), 53 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
>> index c6decb9..2b982b2 100644
>> --- a/drivers/block/xen-blkback/blkback.c
>> +++ b/drivers/block/xen-blkback/blkback.c
>> @@ -78,6 +78,7 @@ struct pending_req {
>> unsigned short operation;
>> int status;
>> struct list_head free_list;
>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> };
>>
>> #define BLKBACK_INVALID_HANDLE (~0)
>> @@ -98,6 +99,30 @@ struct xen_blkbk {
>> static struct xen_blkbk *blkbk;
>>
>> /*
>> + * Maximum number of grant pages that can be mapped in blkback.
>> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
>> + * pages that blkback will persistently map.
>> + */
>> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
>> +{
>> + switch (protocol) {
>> + case BLKIF_PROTOCOL_NATIVE:
>> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> + case BLKIF_PROTOCOL_X86_32:
>> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> + case BLKIF_PROTOCOL_X86_64:
>> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
>
> Could you include in the comments what the size (bytes) you expect it to be?
> If that would require you re-doing some tests - then don't worry - but
> I figured you have some notes scribbled away that have the exact values
> down.

As far as I know and remember (I've checked the ring size in the past),
all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is
always 11, and sizeof(struct persistent_gnt) is 48, so that's 32 * 11 *
48 = 16896 bytes. I will add a comment with this calculation.

>
>> + default:
>> + BUG();
>> + }
>> + return 0;
>> +}
>> +
>> +
>> +/*
>> * Little helpful macro to figure out the index and virtual address of the
>> * pending_pages[..]. For each 'pending_req' we have have up to
>> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
>> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>> static void make_response(struct xen_blkif *blkif, u64 id,
>> unsigned short op, int st);
>>
>> +#define foreach_grant(pos, rbtree, node) \
>> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \
>> + &(pos)->node != NULL; \
>> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node))
>> +
>> +
>> +static void add_persistent_gnt(struct rb_root *root,
>> + struct persistent_gnt *persistent_gnt)
>> +{
>> + struct rb_node **new = &(root->rb_node), *parent = NULL;
>> + struct persistent_gnt *this;
>> +
>> + /* Figure out where to put new node */
>> + while (*new) {
>> + this = container_of(*new, struct persistent_gnt, node);
>> +
>> + parent = *new;
>> + if (persistent_gnt->gnt < this->gnt)
>> + new = &((*new)->rb_left);
>> + else if (persistent_gnt->gnt > this->gnt)
>> + new = &((*new)->rb_right);
>> + else {
>> + pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
>> + BUG();
>> + }
>> + }
>> +
>> + /* Add new node and rebalance tree. */
>> + rb_link_node(&(persistent_gnt->node), parent, new);
>> + rb_insert_color(&(persistent_gnt->node), root);
>> +}
>> +
>> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
>> + grant_ref_t gref)
>> +{
>> + struct persistent_gnt *data;
>> + struct rb_node *node = root->rb_node;
>> +
>> + while (node) {
>> + data = container_of(node, struct persistent_gnt, node);
>> +
>> + if (gref < data->gnt)
>> + node = node->rb_left;
>> + else if (gref > data->gnt)
>> + node = node->rb_right;
>> + else
>> + return data;
>> + }
>> + return NULL;
>> +}
>> +
>> /*
>> * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
>> */
>> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg)
>> {
>> struct xen_blkif *blkif = arg;
>> struct xen_vbd *vbd = &blkif->vbd;
>> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct persistent_gnt *persistent_gnt;
>> + int ret = 0;
>> + int segs_to_unmap = 0;
>>
>> xen_blkif_get(blkif);
>>
>> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg)
>> print_stats(blkif);
>> }
>>
>> + /* Free all persistent grant pages */
>> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {
>
> Just for sanity - you did check this with blkfronts that did not have
> persistent grants enabled, right?

Yes, it doesn't crash, but looking at foreach_grant it seems like it
should. I've added a check before trying to iterate over the tree.

>
>> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE);
>> + gnttab_set_unmap_op(&unmap[segs_to_unmap],
>> + (unsigned long) pfn_to_kaddr(page_to_pfn(
>> + persistent_gnt->page)),
>> + GNTMAP_host_map,
>> + persistent_gnt->handle);
>> +
>> + pages[segs_to_unmap] = persistent_gnt->page;
>> + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts);
>> + kfree(persistent_gnt);
>> + blkif->persistent_gnt_c--;
>> +
>> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
>> + !rb_next(&persistent_gnt->node)) {
>> + ret = gnttab_unmap_refs(unmap, NULL, pages,
>> + segs_to_unmap);
>> + BUG_ON(ret);
>> + segs_to_unmap = 0;
>> + }
>> + }
>> +
>> + BUG_ON(blkif->persistent_gnt_c != 0);
>> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
>> +
>> if (log_stats)
>> print_stats(blkif);
>>
>> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req)
>> int ret;
>>
>> for (i = 0; i < req->nr_pages; i++) {
>> + if (!req->unmap_seg[i])
>> + continue;
>
> Perhaps there should be a #define for that array..

Do you mean something like:

#define unmap(req, i) req->unmap_seg[i]

>> handle = pending_handle(req, i);
>> if (handle == BLKBACK_INVALID_HANDLE)
>> continue;
>> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req)
>>
>> static int xen_blkbk_map(struct blkif_request *req,
>> struct pending_req *pending_req,
>> - struct seg_buf seg[])
>> + struct seg_buf seg[],
>> + struct page *pages[])
>> {
>> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> - int i;
>> + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct persistent_gnt *persistent_gnt = NULL;
>> + struct xen_blkif *blkif = pending_req->blkif;
>> + phys_addr_t addr = 0;
>> + int i, j;
>> + int new_map;
>
> Just use a bool for this.

Done

>> int nseg = req->u.rw.nr_segments;
>> + int segs_to_map = 0;
>> int ret = 0;
>> + int use_persistent_gnts;
>> +
>> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
>> +
>> + BUG_ON(blkif->persistent_gnt_c >
>> + max_mapped_grant_pages(pending_req->blkif->blk_protocol));
>>
>> /*
>> * Fill out preq.nr_sects with proper amount of sectors, and setup
>> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req,
>> for (i = 0; i < nseg; i++) {
>> uint32_t flags;
>>
>> - flags = GNTMAP_host_map;
>> - if (pending_req->operation != BLKIF_OP_READ)
>> - flags |= GNTMAP_readonly;
>> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
>> - req->u.rw.seg[i].gref,
>> - pending_req->blkif->domid);
>> + if (use_persistent_gnts)
>> + persistent_gnt = get_persistent_gnt(
>> + &blkif->persistent_gnts,
>> + req->u.rw.seg[i].gref);
>> +
>> + if (persistent_gnt) {
>> + /*
>> + * We are using persistent grants and
>> + * the grant is already mapped
>> + */
>> + new_map = 0;
>> + } else if (use_persistent_gnts &&
>> + blkif->persistent_gnt_c <
>> + max_mapped_grant_pages(blkif->blk_protocol)) {
>> + /*
>> + * We are using persistent grants, the grant is
>> + * not mapped but we have room for it
>> + */
>> + new_map = 1;
>> + persistent_gnt = kzalloc(
>> + sizeof(struct persistent_gnt),
>> + GFP_KERNEL);
>> + if (!persistent_gnt)
>> + return -ENOMEM;
>> + persistent_gnt->page = alloc_page(GFP_KERNEL);
>> + if (!persistent_gnt->page) {
>> + kfree(persistent_gnt);
>> + return -ENOMEM;
>> + }
>> + persistent_gnt->gnt = req->u.rw.seg[i].gref;
>> +
>> + pages_to_gnt[segs_to_map] =
>> + persistent_gnt->page;
>> + addr = (unsigned long) pfn_to_kaddr(
>> + page_to_pfn(persistent_gnt->page));
>> +
>> + add_persistent_gnt(&blkif->persistent_gnts,
>> + persistent_gnt);
>> + blkif->persistent_gnt_c++;
>> + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
>> + persistent_gnt->gnt, blkif->persistent_gnt_c,
>> + max_mapped_grant_pages(blkif->blk_protocol));
>> + } else {
>> + /*
>> + * We are either using persistent grants and
>> + * hit the maximum limit of grants mapped,
>> + * or we are not using persistent grants.
>> + */
>> + if (use_persistent_gnts &&
>> + !blkif->vbd.overflow_max_grants) {
>> + blkif->vbd.overflow_max_grants = 1;
>> + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
>> + blkif->domid, blkif->vbd.handle);
>> + }
>> + new_map = 1;
>> + pages[i] = blkbk->pending_page(pending_req, i);
>> + addr = vaddr(pending_req, i);
>> + pages_to_gnt[segs_to_map] =
>> + blkbk->pending_page(pending_req, i);
>> + }
>> +
>> + if (persistent_gnt) {
>> + pages[i] = persistent_gnt->page;
>> + persistent_gnts[i] = persistent_gnt;
>> + } else {
>> + persistent_gnts[i] = NULL;
>> + }
>> +
>> + if (new_map) {
>> + flags = GNTMAP_host_map;
>> + if (!persistent_gnt &&
>> + (pending_req->operation != BLKIF_OP_READ))
>> + flags |= GNTMAP_readonly;
>> + gnttab_set_map_op(&map[segs_to_map++], addr,
>> + flags, req->u.rw.seg[i].gref,
>> + blkif->domid);
>> + }
>> }
>>
>> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
>> - BUG_ON(ret);
>> + if (segs_to_map) {
>> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
>> + BUG_ON(ret);
>> + }
>>
>> /*
>> * Now swizzle the MFN in our domain with the MFN from the other domain
>> * so that when we access vaddr(pending_req,i) it has the contents of
>> * the page from the other domain.
>> */
>> - for (i = 0; i < nseg; i++) {
>> - if (unlikely(map[i].status != 0)) {
>> - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>> - map[i].handle = BLKBACK_INVALID_HANDLE;
>> - ret |= 1;
>> + for (i = 0, j = 0; i < nseg; i++) {
>> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) {
>> + /* This is a newly mapped grant */
>> + BUG_ON(j >= segs_to_map);
>> + if (unlikely(map[j].status != 0)) {
>
> I am not seeing j being incremented anywhere? Should it?

Yes, it should be incremented, but not here. See the comment below to
see what I've changed.

>
>> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>> + map[j].handle = BLKBACK_INVALID_HANDLE;
>> + ret |= 1;
>> + if (persistent_gnts[i]) {
>> + rb_erase(&persistent_gnts[i]->node,
>> + &blkif->persistent_gnts);
>> + blkif->persistent_gnt_c--;
>> + kfree(persistent_gnts[i]);
>> + persistent_gnts[i] = NULL;
>> + }
>> + }
>> + }
>> + if (persistent_gnts[i]) {
>> + if (!persistent_gnts[i]->handle) {
>> + /*
>> + * If this is a new persistent grant
>> + * save the handler
>> + */
>> + persistent_gnts[i]->handle = map[j].handle;
>> + persistent_gnts[i]->dev_bus_addr =
>> + map[j++].dev_bus_addr;
>> + }
>> + pending_handle(pending_req, i) =
>> + persistent_gnts[i]->handle;
>> + pending_req->unmap_seg[i] = 0;
>
> Could we have a #define for that?

Sure.

>> +
>> + if (ret)
>> + continue;

This should be

if (ret) {
j++;
continue;
}

>> +
>> + seg[i].buf = persistent_gnts[i]->dev_bus_addr |
>> + (req->u.rw.seg[i].first_sect << 9);
>> + } else {
>> + pending_handle(pending_req, i) = map[j].handle;
>> + pending_req->unmap_seg[i] = 1;
>
> And here as well?

Done.

>> +
>> + if (ret)
>> + continue;
>> +
>> + seg[i].buf = map[j++].dev_bus_addr |
>> + (req->u.rw.seg[i].first_sect << 9);
>> }
>> -
>> - pending_handle(pending_req, i) = map[i].handle;
>> -
>> - if (ret)
>> - continue;
>> -
>> - seg[i].buf = map[i].dev_bus_addr |
>> - (req->u.rw.seg[i].first_sect << 9);
>> }
>> return ret;
>> }
>> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>> int operation;
>> struct blk_plug plug;
>> bool drain = false;
>> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>
>> switch (req->operation) {
>> case BLKIF_OP_READ:
>> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>> * the hypercall to unmap the grants - that is all done in
>> * xen_blkbk_unmap.
>> */
>> - if (xen_blkbk_map(req, pending_req, seg))
>> + if (xen_blkbk_map(req, pending_req, seg, pages))
>> goto fail_flush;
>>
>> /*
>> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>> for (i = 0; i < nseg; i++) {
>> while ((bio == NULL) ||
>> (bio_add_page(bio,
>> - blkbk->pending_page(pending_req, i),
>> + pages[i],
>> seg[i].nsec << 9,
>> seg[i].buf & ~PAGE_MASK) == 0)) {
>>
>> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
>> index 9ad3b5e..6c08ee9 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -34,6 +34,7 @@
>> #include <linux/vmalloc.h>
>> #include <linux/wait.h>
>> #include <linux/io.h>
>> +#include <linux/rbtree.h>
>> #include <asm/setup.h>
>> #include <asm/pgalloc.h>
>> #include <asm/hypervisor.h>
>> @@ -160,10 +161,22 @@ struct xen_vbd {
>> sector_t size;
>> bool flush_support;
>> bool discard_secure;
>> +
>> + unsigned int feature_gnt_persistent:1;
>> + unsigned int overflow_max_grants:1;
>
> I think the v3.7-rc1 has this structure changed just a tiny bit, so you
> might want to rebase it on top of that.

I've done the rebase on top of your linux-next branch, commit
ad502612c173fff239250c9fe9bdfaaef70b9901.

>
>> };
>>
>> struct backend_info;
>>
>> +
>> +struct persistent_gnt {
>> + struct page *page;
>> + grant_ref_t gnt;
>> + grant_handle_t handle;
>> + uint64_t dev_bus_addr;
>> + struct rb_node node;
>> +};
>> +
>> struct xen_blkif {
>> /* Unique identifier for this interface. */
>> domid_t domid;
>> @@ -190,6 +203,10 @@ struct xen_blkif {
>> struct task_struct *xenblkd;
>> unsigned int waiting_reqs;
>>
>> + /* frontend feature information */
>
> Huh?

Changed it to:

/* tree to store persistent grants */

>> + struct rb_root persistent_gnts;
>> + unsigned int persistent_gnt_c;
>> +
>> /* statistics */
>> unsigned long st_print;
>> int st_rd_req;
>> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
>> index 4f66171..9f88b4e 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>> atomic_set(&blkif->drain, 0);
>> blkif->st_print = jiffies;
>> init_waitqueue_head(&blkif->waiting_to_free);
>> + blkif->persistent_gnts.rb_node = NULL;
>>
>> return blkif;
>> }
>> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be)
>> struct xenbus_device *dev = be->dev;
>> unsigned long ring_ref;
>> unsigned int evtchn;
>> + unsigned int pers_grants;
>> char protocol[64] = "";
>> int err;
>>
>> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be)
>> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
>> return -1;
>> }
>> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
>> - ring_ref, evtchn, be->blkif->blk_protocol, protocol);
>> + err = xenbus_gather(XBT_NIL, dev->otherend,
>> + "feature-persistent-grants", "%u",
>> + &pers_grants, NULL);
>> + if (err)
>> + pers_grants = 0;
>> +
>> + be->blkif->vbd.feature_gnt_persistent = pers_grants;
>> + be->blkif->vbd.overflow_max_grants = 0;
>> +
>> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n",
>> + ring_ref, evtchn, be->blkif->blk_protocol, protocol,
>> + pers_grants);
>
> Can you make that a string? So it is
> pers_grants ? "persistent grants" : ""
>
> instead of a zero of one value pls?

Yes, done.

>>
>> /* Map the shared frame, irq etc. */
>> err = xen_blkif_map(be->blkif, ring_ref, evtchn);
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 2c2d2e5..206d422 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -44,6 +44,7 @@
>> #include <linux/mutex.h>
>> #include <linux/scatterlist.h>
>> #include <linux/bitmap.h>
>> +#include <linux/llist.h>
>>
>> #include <xen/xen.h>
>> #include <xen/xenbus.h>
>> @@ -64,10 +65,17 @@ enum blkif_state {
>> BLKIF_STATE_SUSPENDED,
>> };
>>
>> +struct grant {
>> + grant_ref_t gref;
>> + unsigned long pfn;
>> + struct llist_node node;
>> +};
>> +
>> struct blk_shadow {
>> struct blkif_request req;
>> struct request *request;
>> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> };
>>
>> static DEFINE_MUTEX(blkfront_mutex);
>> @@ -97,6 +105,8 @@ struct blkfront_info
>> struct work_struct work;
>> struct gnttab_free_callback callback;
>> struct blk_shadow shadow[BLK_RING_SIZE];
>> + struct llist_head persistent_gnts;
>> + unsigned int persistent_gnts_c;
>> unsigned long shadow_free;
>> unsigned int feature_flush;
>> unsigned int flush_op;
>> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req)
>> unsigned long id;
>> unsigned int fsect, lsect;
>> int i, ref;
>> +
>> + /*
>> + * Used to store if we are able to queue the request by just using
>> + * existing persistent grants (0), or if we have to get new grants,
>
> What does the zero mean?

Frankly, no idea, I guess it was in Oliver's patch and I missed to spot it.

>> + * as there are not sufficiently many free.
>> + */
>> + int new_persistent_gnts;
>
> I think this can be a bool?

I agree.

>> grant_ref_t gref_head;
>> + struct page *granted_page;
>> + struct grant *gnt_list_entry = NULL;
>> struct scatterlist *sg;
>>
>> if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
>> return 1;
>>
>> - if (gnttab_alloc_grant_references(
>> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
>> - gnttab_request_free_callback(
>> - &info->callback,
>> - blkif_restart_queue_callback,
>> - info,
>> - BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> - return 1;
>> - }
>> + /* Check if we have enought grants to allocate a requests */
>> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
>> + new_persistent_gnts = 1;
>> + if (gnttab_alloc_grant_references(
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
>> + &gref_head) < 0) {
>> + gnttab_request_free_callback(
>> + &info->callback,
>> + blkif_restart_queue_callback,
>> + info,
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> + return 1;
>> + }
>> + } else
>> + new_persistent_gnts = 0;
>>
>> /* Fill out a communications ring structure. */
>> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
>> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req)
>> BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>
>> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
>> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
>> fsect = sg->offset >> 9;
>> lsect = fsect + (sg->length >> 9) - 1;
>> - /* install a grant reference. */
>> - ref = gnttab_claim_grant_reference(&gref_head);
>> - BUG_ON(ref == -ENOSPC);
>>
>> - gnttab_grant_foreign_access_ref(
>> - ref,
>> + if (info->persistent_gnts_c) {
>> + BUG_ON(llist_empty(&info->persistent_gnts));
>> + gnt_list_entry = llist_entry(
>> + llist_del_first(&info->persistent_gnts),
>> + struct grant, node);
>> +
>> + ref = gnt_list_entry->gref;
>> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
>> + info->persistent_gnts_c--;
>> + } else {
>> + ref = gnttab_claim_grant_reference(&gref_head);
>> + BUG_ON(ref == -ENOSPC);
>> +
>> + gnt_list_entry =
>> + kmalloc(sizeof(struct grant),
>> + GFP_ATOMIC);
>> + if (!gnt_list_entry)
>> + return -ENOMEM;
>> +
>> + granted_page = alloc_page(GFP_ATOMIC);
>> + if (!granted_page) {
>> + kfree(gnt_list_entry);
>> + return -ENOMEM;
>> + }
>> +
>> + gnt_list_entry->pfn =
>> + page_to_pfn(granted_page);
>> + gnt_list_entry->gref = ref;
>> +
>> + buffer_mfn = pfn_to_mfn(page_to_pfn(
>> + granted_page));
>> + gnttab_grant_foreign_access_ref(ref,
>> info->xbdev->otherend_id,
>> - buffer_mfn,
>> - rq_data_dir(req));
>> + buffer_mfn, 0);
>> + }
>> +
>> + info->shadow[id].grants_used[i] = gnt_list_entry;
>> +
>> + if (rq_data_dir(req)) {
>> + char *bvec_data;
>> + void *shared_data;
>> +
>> + BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>> +
>> + shared_data = kmap_atomic(
>> + pfn_to_page(gnt_list_entry->pfn));
>> + bvec_data = kmap_atomic(sg_page(sg));
>> +
>> + /*
>> + * this does not wipe data stored outside the
>> + * range sg->offset..sg->offset+sg->length.
>> + * Therefore, blkback *could* see data from
>> + * previous requests. This is OK as long as
>> + * persistent grants are shared with just one
>> + * domain. It may need refactoring if This
> .. this (lowercase it pls)

Done.

>
>> + * changes
>> + */
>> + memcpy(shared_data + sg->offset,
>> + bvec_data + sg->offset,
>> + sg->length);
>> +
>> + kunmap_atomic(bvec_data);
>> + kunmap_atomic(shared_data);
>> + }
>>
>> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
>> ring_req->u.rw.seg[i] =
>> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req)
>> /* Keep a private copy so we can reissue requests when recovering. */
>> info->shadow[id].req = *ring_req;
>>
>> - gnttab_free_grant_references(gref_head);
>> + if (new_persistent_gnts)
>> + gnttab_free_grant_references(gref_head);
>>
>> return 0;
>> }
>> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
>> static void xlvbd_flush(struct blkfront_info *info)
>> {
>> blk_queue_flush(info->rq, info->feature_flush);
>> - printk(KERN_INFO "blkfront: %s: %s: %s\n",
>> + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n",
>
> HA! By default, eh?

Yes, you caught me, there's a paragraph in the commit message that
explains that we are using persistent grants in the frontend
unconditionally, since the protocol is compatible (you can have a
persistent blkfront and a non-persistent blkback). It simplifies the
logic in blkfront. Are you OK with it?

>> info->gd->disk_name,
>> info->flush_op == BLKIF_OP_WRITE_BARRIER ?
>> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
>> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work)
>>
>> static void blkif_free(struct blkfront_info *info, int suspend)
>> {
>> + struct llist_node *all_gnts;
>> + struct grant *persistent_gnt;
>> +
>> /* Prevent new requests being issued until we fix things up. */
>> spin_lock_irq(&info->io_lock);
>> info->connected = suspend ?
>> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>> /* No more blkif_request(). */
>> if (info->rq)
>> blk_stop_queue(info->rq);
>> +
>> + /* Remove all persistent grants */
>> + if (info->persistent_gnts_c) {
>> + all_gnts = llist_del_all(&info->persistent_gnts);
>> + llist_for_each_entry(persistent_gnt, all_gnts, node) {
>> + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
>> + kfree(persistent_gnt);
>> + }
>> + info->persistent_gnts_c = 0;
>> + }
>> +
>> /* No more gnttab callback work. */
>> gnttab_cancel_free_callback(&info->callback);
>> spin_unlock_irq(&info->io_lock);
>> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>>
>> }
>>
>> -static void blkif_completion(struct blk_shadow *s)
>> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
>> + struct blkif_response *bret)
>> {
>> int i;
>> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
>> - * flag. */
>> - for (i = 0; i < s->req.u.rw.nr_segments; i++)
>> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
>> + struct bio_vec *bvec;
>> + struct req_iterator iter;
>> + unsigned long flags;
>> + char *bvec_data;
>> + void *shared_data;
>> + unsigned int offset = 0;
>> +
>> + if (bret->operation == BLKIF_OP_READ) {
>> + /*
>> + * Copy the data received from the backend into the bvec.
>> + * Since bv_len can be different from PAGE_SIZE, we need to
>> + * be sure we are actually copying the data from the right
>> + * shared page.
>> + */
>> + rq_for_each_segment(bvec, s->request, iter) {
>> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
>> + i = offset >> PAGE_SHIFT;
>
> Could you also include a comment about the bug you found here, pls?

There's a comment before the rq_for_each_segment loop, that tries to
explain that, do you think the following is better?

/*
* Copy the data received from the backend into the bvec.
* Since bv_offset can be different than 0, and bv_len different
* than PAGE_SIZE, we have to keep track of the current offset,
* to be sure we are copying the data from the right shared page.
*/

>> + shared_data = kmap_atomic(
>> + pfn_to_page(s->grants_used[i]->pfn));
>> + bvec_data = bvec_kmap_irq(bvec, &flags);
>> + memcpy(bvec_data, shared_data + bvec->bv_offset,
>> + bvec->bv_len);
>> + bvec_kunmap_irq(bvec_data, &flags);
>> + kunmap_atomic(shared_data);
>> + offset += bvec->bv_len;
>> + }
>> + }
>> + /* Add the persistent grant into the list of free grants */
>> + for (i = 0; i < s->req.u.rw.nr_segments; i++) {
>> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
>> + info->persistent_gnts_c++;
>> + }
>> }
>>
>> static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>> req = info->shadow[id].request;
>>
>> if (bret->operation != BLKIF_OP_DISCARD)
>> - blkif_completion(&info->shadow[id]);
>> + blkif_completion(&info->shadow[id], info, bret);
>>
>> if (add_id_to_freelist(info, id)) {
>> WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
>> @@ -942,6 +1066,11 @@ again:
>> message = "writing protocol";
>> goto abort_transaction;
>> }
>> + err = xenbus_printf(xbt, dev->nodename,
>> + "feature-persistent-grants", "%d", 1);
>
> So its %u in blkback, but %d in here? Which one should it be?

%u in both places.

>> + if (err)
>> + dev_warn(&dev->dev,
>> + "writing persistent grants feature to xenbus");
>>
>> err = xenbus_transaction_end(xbt, 0);
>> if (err) {
>> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev,
>> spin_lock_init(&info->io_lock);
>> info->xbdev = dev;
>> info->vdevice = vdevice;
>> + init_llist_head(&info->persistent_gnts);
>> + info->persistent_gnts_c = 0;
>> info->connected = BLKIF_STATE_DISCONNECTED;
>> INIT_WORK(&info->work, blkif_restart_queue);
>>
>> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info)
>> req->u.rw.seg[j].gref,
>> info->xbdev->otherend_id,
>> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
>> - rq_data_dir(info->shadow[req->u.rw.id].request));
>> + 0);
>> }
>> info->shadow[req->u.rw.id].req = *req;
>>
>> --
>> 1.7.7.5 (Apple Git-26)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/