Re: [PATCH] mm/zswap: add writethrough option

From: Dan Streetman
Date: Wed Jan 22 2014 - 09:20:27 EST


On Fri, Jan 17, 2014 at 12:41 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
> On Wed, Jan 15, 2014 at 12:42 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
>> Hello,
>>
>> On Tue, Jan 14, 2014 at 10:10:44AM -0500, Dan Streetman wrote:
>>> On Mon, Jan 13, 2014 at 7:11 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
>>> > Hello Dan,
>>> >
>>> > Sorry for the late response and I didn't look at the code yet
>>> > because I am not convinced. :(
>>> >
>>> > On Thu, Dec 19, 2013 at 08:23:27AM -0500, Dan Streetman wrote:
>>> >> Currently, zswap is writeback cache; stored pages are not sent
>>> >> to swap disk, and when zswap wants to evict old pages it must
>>> >> first write them back to swap cache/disk manually. This avoids
>>> >> swap out disk I/O up front, but only moves that disk I/O to
>>> >> the writeback case (for pages that are evicted), and adds the
>>> >> overhead of having to uncompress the evicted pages and the
>>> >> need for an additional free page (to store the uncompressed page).
>>> >>
>>> >> This optionally changes zswap to writethrough cache by enabling
>>> >> frontswap_writethrough() before registering, so that any
>>> >> successful page store will also be written to swap disk. The
>>> >> default remains writeback. To enable writethrough, the param
>>> >> zswap.writethrough=1 must be used at boot.
>>> >>
>>> >> Whether writeback or writethrough will provide better performance
>>> >> depends on many factors including disk I/O speed/throughput,
>>> >> CPU speed(s), system load, etc. In most cases it is likely
>>> >> that writeback has better performance than writethrough before
>>> >> zswap is full, but after zswap fills up writethrough has
>>> >> better performance than writeback.
>>> >
>>> > So you claims we should use writeback default but writethrough
>>> > after memory limit is full?
>>> > But it would break LRU ordering and I think better idea is to
>>> > handle it more generic way rather than chaning entire policy.
>>>
>>> This patch only adds the option of using writethrough. That's all.
>>
>> The point is that please explain that what's the your problem now
>> and prove that adding new option for solve the problem is best.
>> Just "Optionally, having is better" is not good approach to merge/maintain.
>
> You may have missed the earlier emails discussing all that, so to
> recap it appears that writeback is (usually) faster before zswap is
> full, while writethrough appears (usually) slightly faster after zswap
> has filled up. It's highly dependent on the actual system details
> (cpu speed, IO speed, load, etc) though.
>
>>
>>>
>>> > Now, zswap evict out just *a* page rather than a bunch of pages
>>> > so it stucks every store if many swap write happens continuously.
>>> > It's not efficient so how about adding kswapd's threshold concept
>>> > like min/low/high? So, it could evict early before reaching zswap
>>> > memory pool and stop it reaches high watermark.
>>> > I guess it could be better than now.
>>>
>>> Well, I don't think that's related to this patch, but certainly a good idea to
>>> investigate.
>>
>> Why I suggested it that I feel from your description that wb is just
>> slower than wt since zswap memory is pool.
>
> evicting pages early doesn't avoid the overhead of having to
> decompress the pages nor does it avoid having to write them to disk,
> so I don't think it has a direct relation to this patch to add the
> writethrough option.
>
>>
>>>
>>> >
>>> > Other point: As I read device-mapper/cache.txt, cache operating mode
>>> > already supports writethrough. It means zram zRAM can support
>>> > writeback/writethough with dm-cache.
>>> > Have you tried it? Is there any problem?
>>>
>>> zswap isn't a block device though, so that doesn't apply (unless I'm
>>> missing something).
>>
>> zram is block device so freely you can make it to swap block device
>> and binding it with dm-cache will make what you want.
>> The whole point is we could do what you want without adding new
>> so I hope you prove what's the problem in existing solution so that
>> we could judge and try to solve the pain point with more ideal
>> approach.
>
> Sorry, it seems like you're saying "you can drop zswap and start using
> zram, so this patch isn't needed", which really doesn't actually
> address this patch I don't think. zswap vs. zram isn't an argument
> I'm trying to get into right now.
>
>>
>>>
>>> >
>>> > Acutally, I really don't know how much benefit we have that in-memory
>>> > swap overcomming to the real storage but if you want, zRAM with dm-cache
>>> > is another option rather than invent new wheel by "just having is better".
>>>
>>> I'm not sure if this patch is related to the zswap vs. zram discussions. This
>>> only adds the option of using writethrough to zswap. It's a first
>>> step to possibly
>>> making zswap work more efficiently using writeback and/or writethrough
>>> depending on
>>> the system and conditions.
>>
>> The patch size is small. Okay I don't want to be a party-pooper
>> but at least, I should say my thought for Andrew to help judging.
>
> Sure, I'm glad to have your suggestions.

To give this a bump - Andrew do you have any concerns about this
patch? Or can you pick this up?

>
>>
>>>
>>> >
>>> > Thanks.
>>> >
>>> >>
>>> >> Signed-off-by: Dan Streetman <ddstreet@xxxxxxxx>
>>> >>
>>> >> ---
>>> >>
>>> >> Based on specjbb testing on my laptop, the results for both writeback
>>> >> and writethrough are better than not using zswap at all, but writeback
>>> >> does seem to be better than writethrough while zswap isn't full. Once
>>> >> it fills up, performance for writethrough is essentially close to not
>>> >> using zswap, while writeback seems to be worse than not using zswap.
>>> >> However, I think more testing on a wider span of systems and conditions
>>> >> is needed. Additionally, I'm not sure that specjbb is measuring true
>>> >> performance under fully loaded cpu conditions, so additional cpu load
>>> >> might need to be added or specjbb parameters modified (I took the
>>> >> values from the 4 "warehouses" test run).
>>> >>
>>> >> In any case though, I think having writethrough as an option is still
>>> >> useful. More changes could be made, such as changing from writeback
>>> >> to writethrough based on the zswap % full. And the patch doesn't
>>> >> change default behavior - writethrough must be specifically enabled.
>>> >>
>>> >> The %-ized numbers I got from specjbb on average, using the default
>>> >> 20% max_pool_percent and varying the amount of heap used as shown:
>>> >>
>>> >> ram | no zswap | writeback | writethrough
>>> >> 75 93.08 100 96.90
>>> >> 87 96.58 95.58 96.72
>>> >> 100 92.29 89.73 86.75
>>> >> 112 63.80 38.66 19.66
>>> >> 125 4.79 29.90 15.75
>>> >> 137 4.99 4.50 4.75
>>> >> 150 4.28 4.62 5.01
>>> >> 162 5.20 2.94 4.66
>>> >> 175 5.71 2.11 4.84
>>> >>
>>> >>
>>> >>
>>> >> mm/zswap.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>> >> 1 file changed, 64 insertions(+), 4 deletions(-)
>>> >>
>>> >> diff --git a/mm/zswap.c b/mm/zswap.c
>>> >> index e55bab9..2f919db 100644
>>> >> --- a/mm/zswap.c
>>> >> +++ b/mm/zswap.c
>>> >> @@ -61,6 +61,8 @@ static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
>>> >> static u64 zswap_pool_limit_hit;
>>> >> /* Pages written back when pool limit was reached */
>>> >> static u64 zswap_written_back_pages;
>>> >> +/* Pages evicted when pool limit was reached */
>>> >> +static u64 zswap_evicted_pages;
>>> >> /* Store failed due to a reclaim failure after pool limit was reached */
>>> >> static u64 zswap_reject_reclaim_fail;
>>> >> /* Compressed page was too big for the allocator to (optimally) store */
>>> >> @@ -89,6 +91,10 @@ static unsigned int zswap_max_pool_percent = 20;
>>> >> module_param_named(max_pool_percent,
>>> >> zswap_max_pool_percent, uint, 0644);
>>> >>
>>> >> +/* Writeback/writethrough mode (fixed at boot for now) */
>>> >> +static bool zswap_writethrough;
>>> >> +module_param_named(writethrough, zswap_writethrough, bool, 0444);
>>> >> +
>>> >> /*********************************
>>> >> * compression functions
>>> >> **********************************/
>>> >> @@ -629,6 +635,48 @@ end:
>>> >> }
>>> >>
>>> >> /*********************************
>>> >> +* evict code
>>> >> +**********************************/
>>> >> +
>>> >> +/*
>>> >> + * This evicts pages that have already been written through to swap.
>>> >> + */
>>> >> +static int zswap_evict_entry(struct zbud_pool *pool, unsigned long handle)
>>> >> +{
>>> >> + struct zswap_header *zhdr;
>>> >> + swp_entry_t swpentry;
>>> >> + struct zswap_tree *tree;
>>> >> + pgoff_t offset;
>>> >> + struct zswap_entry *entry;
>>> >> +
>>> >> + /* extract swpentry from data */
>>> >> + zhdr = zbud_map(pool, handle);
>>> >> + swpentry = zhdr->swpentry; /* here */
>>> >> + zbud_unmap(pool, handle);
>>> >> + tree = zswap_trees[swp_type(swpentry)];
>>> >> + offset = swp_offset(swpentry);
>>> >> + BUG_ON(pool != tree->pool);
>>> >> +
>>> >> + /* find and ref zswap entry */
>>> >> + spin_lock(&tree->lock);
>>> >> + entry = zswap_rb_search(&tree->rbroot, offset);
>>> >> + if (!entry) {
>>> >> + /* entry was invalidated */
>>> >> + spin_unlock(&tree->lock);
>>> >> + return 0;
>>> >> + }
>>> >> +
>>> >> + zswap_evicted_pages++;
>>> >> +
>>> >> + zswap_rb_erase(&tree->rbroot, entry);
>>> >> + zswap_entry_put(tree, entry);
>>> >> +
>>> >> + spin_unlock(&tree->lock);
>>> >> +
>>> >> + return 0;
>>> >> +}
>>> >> +
>>> >> +/*********************************
>>> >> * frontswap hooks
>>> >> **********************************/
>>> >> /* attempts to compress and store an single page */
>>> >> @@ -744,7 +792,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>>> >> spin_lock(&tree->lock);
>>> >> entry = zswap_entry_find_get(&tree->rbroot, offset);
>>> >> if (!entry) {
>>> >> - /* entry was written back */
>>> >> + /* entry was written back or evicted */
>>> >> spin_unlock(&tree->lock);
>>> >> return -1;
>>> >> }
>>> >> @@ -778,7 +826,7 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>>> >> spin_lock(&tree->lock);
>>> >> entry = zswap_rb_search(&tree->rbroot, offset);
>>> >> if (!entry) {
>>> >> - /* entry was written back */
>>> >> + /* entry was written back or evicted */
>>> >> spin_unlock(&tree->lock);
>>> >> return;
>>> >> }
>>> >> @@ -813,18 +861,26 @@ static void zswap_frontswap_invalidate_area(unsigned type)
>>> >> zswap_trees[type] = NULL;
>>> >> }
>>> >>
>>> >> -static struct zbud_ops zswap_zbud_ops = {
>>> >> +static struct zbud_ops zswap_zbud_writeback_ops = {
>>> >> .evict = zswap_writeback_entry
>>> >> };
>>> >> +static struct zbud_ops zswap_zbud_writethrough_ops = {
>>> >> + .evict = zswap_evict_entry
>>> >> +};
>>> >>
>>> >> static void zswap_frontswap_init(unsigned type)
>>> >> {
>>> >> struct zswap_tree *tree;
>>> >> + struct zbud_ops *ops;
>>> >>
>>> >> tree = kzalloc(sizeof(struct zswap_tree), GFP_KERNEL);
>>> >> if (!tree)
>>> >> goto err;
>>> >> - tree->pool = zbud_create_pool(GFP_KERNEL, &zswap_zbud_ops);
>>> >> + if (zswap_writethrough)
>>> >> + ops = &zswap_zbud_writethrough_ops;
>>> >> + else
>>> >> + ops = &zswap_zbud_writeback_ops;
>>> >> + tree->pool = zbud_create_pool(GFP_KERNEL, ops);
>>> >> if (!tree->pool)
>>> >> goto freetree;
>>> >> tree->rbroot = RB_ROOT;
>>> >> @@ -875,6 +931,8 @@ static int __init zswap_debugfs_init(void)
>>> >> zswap_debugfs_root, &zswap_reject_compress_poor);
>>> >> debugfs_create_u64("written_back_pages", S_IRUGO,
>>> >> zswap_debugfs_root, &zswap_written_back_pages);
>>> >> + debugfs_create_u64("evicted_pages", S_IRUGO,
>>> >> + zswap_debugfs_root, &zswap_evicted_pages);
>>> >> debugfs_create_u64("duplicate_entry", S_IRUGO,
>>> >> zswap_debugfs_root, &zswap_duplicate_entry);
>>> >> debugfs_create_u64("pool_pages", S_IRUGO,
>>> >> @@ -919,6 +977,8 @@ static int __init init_zswap(void)
>>> >> pr_err("per-cpu initialization failed\n");
>>> >> goto pcpufail;
>>> >> }
>>> >> + if (zswap_writethrough)
>>> >> + frontswap_writethrough(true);
>>> >> frontswap_register_ops(&zswap_frontswap_ops);
>>> >> if (zswap_debugfs_init())
>>> >> pr_warn("debugfs initialization failed\n");
>>> >> --
>>> >> 1.8.3.1
>>> >>
>>> >> --
>>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> >> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
>>> >> see: http://www.linux-mm.org/ .
>>> >> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>>> >
>>> > --
>>> > Kind regards,
>>> > Minchan Kim
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>>
>> --
>> Kind regards,
>> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/