Re: [RFC PATCH] f2fs: add extent cache base on rb-tree

From: Jaegeuk Kim
Date: Tue Jan 06 2015 - 18:00:49 EST


Hi Chao,

On Sun, Jan 04, 2015 at 04:24:10PM +0800, Chao Yu wrote:
> Hi Jaegeuk,
>
> > -----Original Message-----
> > From: Jaegeuk Kim [mailto:jaegeuk@xxxxxxxxxx]
> > Sent: Wednesday, December 31, 2014 4:26 PM
> > To: Chao Yu
> > Cc: 'Changman Lee'; linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> > Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree
> >
> > Hi Chao,
> >
> > On Tue, Dec 30, 2014 at 06:10:21PM +0800, Chao Yu wrote:
> > > Hi Jaegeuk,
> > >
> > > > -----Original Message-----
> > > > From: Jaegeuk Kim [mailto:jaegeuk@xxxxxxxxxx]
> > > > Sent: Tuesday, December 30, 2014 5:23 AM
> > > > To: Chao Yu
> > > > Cc: 'Changman Lee'; linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> > > > Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree
> > > >
> > > > Hi Chao,
> > > >
> > > > On Mon, Dec 29, 2014 at 03:19:18PM +0800, Chao Yu wrote:
> > > >
> > > > [snip]
> > > >
> > > > Nice draft. :)
> > >
> > > Thanks! :)
> > >
> > > >
> > > > >
> > > > > Please see the draft below.
> > > > >
> > > > > 1) Extent management:
> > > > > If we use global management that managing all extents which are from different
> > > > > inodes in sbi, we will face with serious lock contention when we access these
> > > > > extents belong to different inodes concurrently, the loss may outweights the
> > > > > gain.
> > > >
> > > > Agreed.
> > > >
> > > > > So we choose a local management for extent which means all extents are
> > > > > managed by inode itself to avoid above lock contention. Addtionlly, we manage
> > > > > all extents globally by linking all inode into a global lru list for extent
> > > > > cache shrinker.
> > > > > Approach:
> > > > > a) build extent tree/rwlock/lru list/extent count in each inode.
> > > > > *extent tree: link all extent in rb-tree;
> > > > > *rwlock: protect fields when accessing extent cache concurrently;
> > > > > *lru list: sort all extents in accessing time order;
> > > > > *extent count: record total count of extents in cache.
> > > > > b) use lru shrink list in sbi to manage all inode which cached extents.
> > > > > *inode will be added or repostioned in this global list whenever
> > > > > extent is being access in this inode.
> > > > > *use spinlock to protect this shrink list.
> > > >
> > > > 1. How about adding a data structure with inode number instead of referring
> > > > inode pointer?
> > > >
> > > > 2. How about managing extent entries globally and setting an upper bound to
> > > > the number of extent entries instead of limiting them per each inode?
> > > > (The rb-tree will handle many extents per inode.)
> > >
> > > [assumption]
> > > Approach A: global lru list;
> > > Approach B: lru list per inode.
> > >
> > > I have considered Approach A before writing the draft, although Approach A could
> > > provide us higher ratio hit due to global lru, it may make lock contention more
> > > intensively and has more memory overhead descripted below. So I choose more
> > > conservative Approach B.
> > > 1) as extent_entry will split in Cow, we may lock extent_list more times when
> > > move these separated extent entries in extent_list, unless we have batch mode
> > > interface.
> > > 2) lock overhead between shrinker and lookuper and updater.
> > > e.g. extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU)
> > > (#1) has A, C, E, G in rb-tree
> > > (#2) has B, D, F in rb-tree
> > > shrinker op: lock list -> trylock #1 -> unlock list -> free A -> unlock #1
> > > lock list -> trylock #2 -> unlock list -> free B -> unlock #2
> > > lock list -> trylock #1 -> unlock list -> free C -> unlock #1
> > > ...
> >
> > lock_list -> list_del(A) -> unlock_list -> lock #1 -> remove_tree(A) -> unlock #1
> > -> free A.
>
> If unlock_list before lock #1, (A) could be modified by other thread between ->unlock_list
> and lock #1, so one more "lookup_tree(A)" is needed under lock #1:
> lock #1 -> lookup_tree(A) -> remove_tree(A) -> unlock #1
>
> >
> > Or,
> > lock_list -> list_del(A, B, C) -> unlock_list -> lock #1 -> remove_tree(A, C) ->
> > unlock #1 -> free A, C -> lock #2 -> remove_tree(B) -> unlock #2
>
> ditto.
> To avoid unneeded lookup_tree(A), maybe we can use this series:
>
> lock list -> get A from head -> trylock #1 -> try to get more node(C, E, G) of #1 in list
> -> unlock list -> remove_tree&free(A, C, E, G) -> unlock #1
>
> >
> > I think a couple of list operations do not cause severe lock contention.
>
> Yeah, batch operation is better.
> But sometimes random write break extent into everywhere in shrinker's list,
> so batch operation can gain less in this condition.
>
> > And, if we assing minimum length for extent caches, the contention would be
> > distributed.
> >
> > This means that, it needs to manage each of all the extents in the cache should
> > have the length larger than minimum value.
>
> If I do not misunderstand, what you mean is that if extent number in one inode cache
> is bigger than minimum value, then, we will start to add all extents of this inode
> into shrinker's list, and reposition them in the list when f2fs_{lookup,update}_extent_cache?
>
> >
> > > *trylock/unlock* for all free extent entries belong to one inode will be separated
> > > to lots of parts, resulting in more contention.
> > > 3) it has more memory overhead in each extent entry:
> > > Approach A: need add inode number for backref and list_head *
> >
> > It doesn't need to add ino. Just remain them, and shrinker can remove empty
> > ino_entry_for_extents periodically.
>
> I do not understand why we don't need ino, :(.
> Without ino, how can we get the rb-tree root pointer for rb_erase(node, root)?

Let me describe in more detail.

For example,

- global radix tree in sbi
--> ino #1 --> rb_tree (rb_tree, rb_tree_lock, refcount=0)
--> ino #2 --> rb_tree
--> ino #3 --> rb_tree
`-> extent A
`-> extent B (fofs, blkaddr, length, *list_head=empty)

- global extent list in sbi

extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU)
(#1) has A, C, E, G in rb-tree
(#2) has B, D, F in rb-tree

1) update extents (#1)

- lock_radix_tree
if (notfound #1)
allocate #1;
#1's refcount++;
- unlock_radix_tree

- lock_rb_tree (#1)
update A
rb_delete E, if E->length is smaller than MIN_LEN.
allocate & rb_add K, if K->length is larget than MIN_LEN.

list_lock
list_move(A)
list_del(E)
list_add_tail(K)
list_unlock
- unlock_rb_tree (#1)

- #1's rafcount--;

2) shrinker

- list_lock
list_del A, B, C, ...
- list_unlock

- gang_look_up_radix_tree_with_lock {

#N's refcount++;

lock_rb_tree (#N)
for_each_rb_node {
if (list_empty(extent->list_head)) {
remove_rb_node(extent);
free(extent);
}
}
unlock_rb_tree (#N)

#N's refcount--;
- }

- for_each_radix_tree_with_lock {
if (radix_node->refcount == 0 && rb_tree_is_empty)
free(radix_node);
- }

3) lookup extents (#1)

- lock_radix_tree
if (notfound #1)
goto out;
#1's refcount++;
- unlock_radix_tree

- lock_rb_tree (#1)
found extent A

list_lock
list_move(A)
list_unlock
- unlock_rb_tree (#1)

- #1's rafcount--;


>
> >
> > > Approach B: need add list_head *
> > >
> > > Or maybe it's not a problem because we can afford all these above.
> > >
> > > Anyway, I'm not against.
> > >
> > > >
> > > > 3. It needs to set a minimum length for the candidate of extent cache.
> > > > (e.g., 64)
> > >
> > > Is this used for shrinker? Can you please explain more about this minimum length?
> > >
> > > >
> > > > So, for example,
> > > > struct ino_entry_for_extents {
> > > > inode number;
> > > > rb_tree for extent_entry objects;
> > > > rwlock;
> > > > };
> > > >
> > > > struct extent_entry {
> > > > blkaddr, len;
> > > > list_head *;
> > >
> > > We should add one other field 'inode number' here, as through it we can get
> > > rb-tree root from ino_entry_for_extents for rb_erase when we free the extent
> > > entry in shrinker.
> > >
> > > > };
> > > >
> > > > Something like this.
> > > >
> > > > [A, B, C, ... are extent entry]
> > > >
> > > > The sbi has
> > > > 1. an extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU)
> > > > 2. radix_tree: ino_entry_for_extents (#10) has D, B in rb-tree
> > > > ` ino_entry_for_extents (#11) has A, C in rb-tree
> > > > ` ino_entry_for_extents (#12) has F in rb-tree
> > > > ` ino_entry_for_extents (#13) has G, E in rb-tree
> > > >
> > > > In f2fs_update_extent_cache and __get_data_block for #10,
> > > > ino_entry_for_extents (#10) was founded and updated D or B.
> > > > Then, updated entries are moved to MRU.
> > > >
> > > > In f2fs_evict_inode for #11, A and C are moved to LRU.
> > > > But, if this inode is unlinked, all the A, C, and ino_entry_for_extens (#11)
> > > > should be released.
> > > >
> > > > In f2fs_balance_fs_bg, some LRU extents are released according to the amount
> > > > of consumed memory. Then, it frees any ino_entry_for_extents having no extent.
> > > >
> > > > IMO, we don't need to consider readahead for this, since get_data_block will
> > > > be called by VFS readahead.
> > >
> > > In get_data_block we can cache only one blkaddr once after get_dnode_of_data
> > > returned, it seems less efficient. So why not readahead selectively more
> > > contiguous valid blkaddr in get_dnode_of_data once?
> >
> > Why do you think only one blkaddr?
> > get_data_block searches extents as many as possible.
>
> Surely not one blkaddr.
>
> Actually, what I mean is that:
> ->get_dnode_of_data
> ->f2fs_ra_extent_cache()
>
> f2fs_ra_extent_cache(node_page, offset) {
> for (i = offset; i < max; i++) {
> if (is_valid() && is_contiguous())
> len++;
> }
> f2fs_update_extent_cache(inode, fofs, blk_addr, len)
> }
>
> But not:
> ->__get_data_block
> ->get_dnode_of_data
> ->f2fs_update_extent_cache(inode, fofs, blk_addr, 1);
>
> Thanks,
> Yu
>
> >
> > Thanks,
> >
> > >
> > > >
> > > > Furthermore, we need to think about whether LRU is really best or not.
> > > > IMO, the extent cache aims to improve second access speed, rather than initial
> > > > cold misses. So, maybe MRU or another algorithms would be better.
> > >
> > > Agree, let's rethink about this. :)
> > >
> > > Thanks,
> > >
> > > >
> > > > Thanks,
> > > >
> > > > >
> > > > > 2) Limitation:
> > > > > In one inode, as we split or add extent in extent cache when read/write, extent
> > > > > number will enlarge, so memory and CPU overhead will increase.
> > > > > In order to control the overhead of memory and CPU, we try to set a upper bound
> > > > > number to limit total extent number in each inode, This number is global
> > > > > configuration which is visable to all inode. This number will be exported to
> > > > > sysfs for configuring according to requirement of user. By default, designed
> > > > > number is 8.
> > > > >
> > > > > 3) Shrinker:
> > > > > There are two shrink paths:
> > > > > a) one is triggered when extent count has exceed the upper bound of
> > > > > inode's extent cache. We will try to release extent(s) from head of
> > > > > inode's inner extent lru list until extent count is equal to upper bound.
> > > > > This operation could be in f2fs_update_extent_cache().
> > > > > b) the other one is triggered when memory util exceed threshold, we try
> > > > > get inode from head of global lru list(s), and release extent(s) with
> > > > > fixed number (by default: 64 extents) in inode one by one.
> > > > > This operation could be in f2fs_balance_fs_bg().
> > > > >
> > > > > 4) Revalidation:
> > > > > In ->evict(), extent cache will be released, in order to reuse extent cache
> > > > > of inode when reopen for high hit ratio, a proper way is to add cached extent
> > > > > tree into a hash table (could be radix tree), revalidate extent tree and recover
> > > > > it to inode when reopen.
> > > > > Besides, a global lru list is needed here for shrinker.
> > > > >
> > > > > 5) Readahead:
> > > > > Expand extent cache by readaheading when we call get_dnode_of_data in non-emergency
> > > > > path. Note, ra can affect lock contention for both ->rwlock in extent cache and
> > > > > ->lock of global shrink list.
> > > > >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/