Re: [PATCH v3] zram: support REQ_DISCARD

From: Andrew Morton
Date: Wed Mar 12 2014 - 16:33:33 EST


On Wed, 12 Mar 2014 17:01:09 +0900 Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> wrote:

> zram is ram based block device and can be used by backend of filesystem.
> When filesystem deletes a file, it normally doesn't do anything on data
> block of that file. It just marks on metadata of that file. This behavior
> has no problem on disk based block device, but has problems on ram based
> block device, since we can't free memory used for data block. To overcome
> this disadvantage, there is REQ_DISCARD functionality. If block device
> support REQ_DISCARD and filesystem is mounted with discard option,
> filesystem sends REQ_DISCARD to block device whenever some data blocks are
> discarded. All we have to do is to handle this request.
>
> This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
> REQ_DISCARD request. With it, we can free memory used by zram if it isn't
> used.
>
> ...
>
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -541,6 +541,33 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
> return ret;
> }
>
> +static void zram_bio_discard(struct zram *zram, u32 index,
> + int offset, struct bio *bio)

A little bit of documentation here wouldn't hurt. "index" and "offset"
are pretty vague identifiers. What do these args represent and what
are their units.

> +{
> + size_t n = bio->bi_iter.bi_size;
> +
> + /*
> + * On some arch, logical block (4096) aligned request couldn't be
> + * aligned to PAGE_SIZE, since their PAGE_SIZE aren't 4096.
> + * Therefore we should handle this misaligned case here.
> + */
> + if (offset) {
> + if (n < offset)
> + return;
> +
> + n -= offset;
> + index++;
> + }
> +
> + while (n >= PAGE_SIZE) {
> + write_lock(&zram->meta->tb_lock);
> + zram_free_page(zram, index);
> + write_unlock(&zram->meta->tb_lock);
> + index++;
> + n -= PAGE_SIZE;
> + }

We could take the lock a single time rather than once per page. Was
there a reason for doing it this way? If so, that should be documented
as well please - there is no way a reader can know the reason from this
code.


> +}
> +
> static void zram_reset_device(struct zram *zram, bool reset_capacity)
> {
> size_t index;
> @@ -676,6 +703,12 @@ static void __zram_make_request(struct zram *zram, struct bio *bio)
> offset = (bio->bi_iter.bi_sector &
> (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
>
> + if (unlikely(bio->bi_rw & REQ_DISCARD)) {
> + zram_bio_discard(zram, index, offset, bio);
> + bio_endio(bio, 0);
> + return;
> + }
> +
> bio_for_each_segment(bvec, bio, iter) {
> int max_transfer_size = PAGE_SIZE - offset;
>
> @@ -845,6 +878,17 @@ static int create_device(struct zram *zram, int device_id)
> ZRAM_LOGICAL_BLOCK_SIZE);
> blk_queue_io_min(zram->disk->queue, PAGE_SIZE);
> blk_queue_io_opt(zram->disk->queue, PAGE_SIZE);
> + zram->disk->queue->limits.discard_granularity = PAGE_SIZE;
> + zram->disk->queue->limits.max_discard_sectors = UINT_MAX;
> + /*
> + * We will skip to discard mis-aligned range, so we can't ensure
> + * whether discarded region is zero or not.
> + */

That's a bit hard to follow. What is it that is misaligned, relative
to what?

And where does this skipping occur? zram_bio_discard() avoids
discarding partial pages at the start and end of the bio (I think). Is
that what we're referring to here? If so, what about the complete
pages between the two partial pages - they are zeroed on read. Will
the code end up having to rezero those?

As you can tell, I'm struggling to understand what's going on here ;)
Some additional description of how it all works would be nice. Perferably
as code comments so the information is permanent.

> + if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
> + zram->disk->queue->limits.discard_zeroes_data = 1;
> + else
> + zram->disk->queue->limits.discard_zeroes_data = 0;
> + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
>
> add_disk(zram->disk);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/