Re: [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()

From: Liam R. Howlett
Date: Fri Sep 08 2023 - 12:06:44 EST


* Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230908 05:26]:
>
>
> 在 2023/9/8 04:13, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230830 08:57]:
> > > Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> > > duplicate a maple tree. Compared with traversing the source tree and
> > > reinserting entry by entry in the new tree, it has better performance.
> > > The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> > > handles locks internally.
> >
> > __mt_dup() should be called mas_dup() to indicate the advanced interface
> > which requires users to handle their own locks.
> Ok, I'll change __mt_dup() to mas_dup().
> >
> > >
> > > Signed-off-by: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx>
> > > ---
> > > include/linux/maple_tree.h | 3 +
> > > lib/maple_tree.c | 265 +++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 268 insertions(+)
> > >
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index e41c70ac7744..44fe8a57ecbd 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> > > void *entry, gfp_t gfp);
> > > void *mtree_erase(struct maple_tree *mt, unsigned long index);
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +
> > > void mtree_destroy(struct maple_tree *mt);
> > > void __mt_destroy(struct maple_tree *mt);
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index ef234cf02e3e..8f841682269c 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> > > }
> > > EXPORT_SYMBOL(mtree_erase);
> > > +/*
> > > + * mas_dup_free() - Free a half-constructed tree.
> >
> > Maybe "Free an incomplete duplication of a tree" ?
> >
> > > + * @mas: Points to the last node of the half-constructed tree.
> >
> > Your use of "Points to" seems to indicate someone knows you are talking
> > about a "maple state that has a node pointing to". Can this be made
> > more clear?
> > @mas: The maple state of a incomplete tree.
> >
> > Then add a note that @mas->node points to the last successfully
> > allocated node?
> >
> > Or something along those lines.
> Ok, I'll revise the comment.
> >
> > > + *
> > > + * This function frees all nodes starting from @mas->node in the reverse order
> > > + * of mas_dup_build(). There is no need to hold the source tree lock at this
> > > + * time.
> > > + */
> > > +static void mas_dup_free(struct ma_state *mas)
> > > +{
> > > + struct maple_node *node;
> > > + enum maple_type type;
> > > + void __rcu **slots;
> > > + unsigned char count, i;
> > > +
> > > + /* Maybe the first node allocation failed. */
> > > + if (!mas->node)
> > > + return;
> > > +
> > > + while (!mte_is_root(mas->node)) {
> > > + mas_ascend(mas);
> > > +
> > > + if (mas->offset) {
> > > + mas->offset--;
> > > + do {
> > > + mas_descend(mas);
> > > + mas->offset = mas_data_end(mas);
> > > + } while (!mte_is_leaf(mas->node));
> >
> > Can you blindly descend and check !mte_is_leaf()? What happens when the
> > tree duplication fails at random internal nodes? Maybe I missed how
> > this cannot happen?
> This cannot happen. Note the mas_ascend(mas) at the beginning of the
> outermost loop.
>
> >
> > > +
> > > + mas_ascend(mas);
> > > + }
> > > +
> > > + node = mte_to_node(mas->node);
> > > + type = mte_node_type(mas->node);
> > > + slots = (void **)ma_slots(node, type);
> > > + count = mas_data_end(mas) + 1;
> > > + for (i = 0; i < count; i++)
> > > + ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> > > +
> > > + mt_free_bulk(count, slots);
> > > + }
> >
> >
> > > +
> > > + node = mte_to_node(mas->node);
> > > + mt_free_one(node);
> > > +}
> > > +
> > > +/*
> > > + * mas_copy_node() - Copy a maple node and allocate child nodes.
> >
> > if required. "..and allocate child nodes if required."
> >
> > > + * @mas: Points to the source node.
> > > + * @new_mas: Points to the new node.
> > > + * @parent: The parent node of the new node.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> > > + * @new_mas->node and allocate new child nodes for @new_mas->node.
> > > + * If memory allocation fails, @mas is set to -ENOMEM.
> > > + */
> > > +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> > > + struct maple_node *parent, gfp_t gfp)
> > > +{
> > > + struct maple_node *node = mte_to_node(mas->node);
> > > + struct maple_node *new_node = mte_to_node(new_mas->node);
> > > + enum maple_type type;
> > > + unsigned long val;
> > > + unsigned char request, count, i;
> > > + void __rcu **slots;
> > > + void __rcu **new_slots;
> > > +
> > > + /* Copy the node completely. */
> > > + memcpy(new_node, node, sizeof(struct maple_node));
> > > +
> > > + /* Update the parent node pointer. */
> > > + if (unlikely(ma_is_root(node)))
> > > + val = MA_ROOT_PARENT;
> > > + else
> > > + val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> >
> > If you treat the root as special and outside the loop, then you can
> > avoid the check for root for every non-root node. For root, you just
> > need to copy and do this special parent thing before the main loop in
> > mas_dup_build(). This will avoid an extra branch for each VMA over 14,
> > so that would add up to a lot of instructions.
> I'll handle the root node outside.
> However, do you think it makes sense to have the parent of the root node
> point to the struct maple_tree? I don't see it used anywhere.

I'm not sure. It needs to not point to itself (indicating it is dead),
and we need to tell it's the root node, but I'm not entirely sure it is
necessary to point to the maple_tree.. although it is useful in dumps
sometimes.

>
> >
> > > +
> > > + new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> > > +
> > > + if (mte_is_leaf(mas->node))
> > > + return;
> >
> > You are checking here and in mas_dup_build() for the leaf, splitting the
> > function into parent assignment and allocate would allow you to check
> > once. Copy could be moved to the main loop or with the parent setting,
> > depending on how you handle the root suggestion above.
> I'll try to reduce some checks.
> >
> > > +
> > > + /* Allocate memory for child nodes. */
> > > + type = mte_node_type(mas->node);
> > > + new_slots = ma_slots(new_node, type);
> > > + request = mas_data_end(mas) + 1;
> > > + count = mt_alloc_bulk(gfp, request, new_slots);
> > > + if (unlikely(count < request)) {
> > > + if (count)
> > > + mt_free_bulk(count, new_slots);
> >
> > The new_slots will still contain the addresses of the freed nodes.
> > Don't you need to clear it here to avoid a double free? Is there a
> > test case for this in your testing? Again, I may have missed how this
> > is not possible..
> It's impossible, because in mt_free_bulk(), the first thing to do with
> the incoming node is to go up. We free all child nodes at the parent
> node.
>
> We guarantee that the node passed to mas_dup_free() is "clean".

You mean there are no allocations below?

> mas_dup_free() also follows this so will not free children of this node.
>
> The child nodes of this node cannot be freed in mt_free_bulk() because
> the node is not completely constructed and data_end cannot be obtained.
> data_end cannot be set on this node because the number of successfully
> allocated child nodes can be 0.

It still seems unwise to keep pointers pointing to unallocated memory
here, can we just clear the slots?

It's a bit odd because our choice is to leave it with pointers to nodes
in another tree or potential unallocated memory that isn't anywhere.
Both are a bit unnerving passing into another function that cleans
things up. Since it's the error path, we won't have a performance
penalty in wiping the slots and it doesn't really matter that the node
isn't valid. It seems more likely we would catch the error in a more
identifiable place if we set the slots to NULL.

> >
> > > + mas_set_err(mas, -ENOMEM);
> > > + return;
> > > + }
> > > +
> > > + /* Restore node type information in slots. */
> > > + slots = ma_slots(node, type);
> > > + for (i = 0; i < count; i++)
> > > + ((unsigned long *)new_slots)[i] |=
> > > + ((unsigned long)mt_slot_locked(mas->tree, slots, i) &
> > > + MAPLE_NODE_MASK);
> >
> > Can you expand this to multiple lines to make it more clear what is
> > going on?
> I will try to do that.
>
> >
> > > +}
> > > +
> > > +/*
> > > + * mas_dup_build() - Build a new maple tree from a source tree
> > > + * @mas: The maple state of source tree.
> > > + * @new_mas: The maple state of new tree.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * This function builds a new tree in DFS preorder. If the memory allocation
> > > + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
> > > + * last node. mas_dup_free() will free the half-constructed tree.
> > > + *
> > > + * Note that the attributes of the two trees must be exactly the same, and the
> > > + * new tree must be empty, otherwise -EINVAL will be returned.
> > > + */
> > > +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
> > > + gfp_t gfp)
> > > +{
> > > + struct maple_node *node, *parent;
> >
> > Could parent be struct maple_pnode?
> I'll rename it.
>
> >
> > > + struct maple_enode *root;
> > > + enum maple_type type;
> > > +
> > > + if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
> > > + unlikely(!mtree_empty(new_mas->tree))) {
> > > + mas_set_err(mas, -EINVAL);
> > > + return;
> > > + }
> > > +
> > > + mas_start(mas);
> > > + if (mas_is_ptr(mas) || mas_is_none(mas)) {
> > > + /*
> > > + * The attributes of the two trees must be the same before this.
> > > + * The following assignment makes them the same height.
> > > + */
> > > + new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > + rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
> > > + return;
> > > + }
> > > +
> > > + node = mt_alloc_one(gfp);
> > > + if (!node) {
> > > + new_mas->node = NULL;
> >
> > We don't have checks around for node == NULL, MAS_NONE would be a safer
> > choice. It is unlikely that someone would dup the tree and fail then
> > call something else, but I avoid setting node to NULL.
> I will set it to MAS_NONE in the next version.
>
> >
> > > + mas_set_err(mas, -ENOMEM);
> > > + return;
> > > + }
> > > +
> > > + type = mte_node_type(mas->node);
> > > + root = mt_mk_node(node, type);
> > > + new_mas->node = root;
> > > + new_mas->min = 0;
> > > + new_mas->max = ULONG_MAX;
> > > + parent = ma_mnode_ptr(new_mas->tree);
> > > +
> > > + while (1) {
> > > + mas_copy_node(mas, new_mas, parent, gfp);
> > > +
> > > + if (unlikely(mas_is_err(mas)))
> > > + return;
> > > +
> > > + /* Once we reach a leaf, we need to ascend, or end the loop. */
> > > + if (mte_is_leaf(mas->node)) {
> > > + if (mas->max == ULONG_MAX) {
> > > + new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > + rcu_assign_pointer(new_mas->tree->ma_root,
> > > + mte_mk_root(root));
> > > + break;
> >
> > If you move this to the end of the function, you can replace the same
> > block above with a goto. That will avoid breaking the line up.
> I can do this, but it doesn't seem to make a difference.

Thanks, Just for clarity of keeping it all on one line and sine there's
going to be a respin of the set anyways..

> >
> > > + }
> > > +
> > > + do {
> > > + /*
> > > + * Must not at the root node, because we've
> > > + * already end the loop when we reach the last
> > > + * leaf.
> > > + */
> >
> > I'm not sure what the comment above is trying to say. Do you mean "This
> > won't reach the root node because the loop will break when the last leaf
> > is hit"? I don't think that is accurate.. it will hit the root node but
> > not the end of the root node, right? Anyways, the comment isn't clear
> > so please have a look.
> Yes, it will hit the root node but not the end of the root node. I'll
> fix this comment. Thanks.
>
> >
> > > + mas_ascend(mas);
> > > + mas_ascend(new_mas);
> > > + } while (mas->offset == mas_data_end(mas));
> > > +
> > > + mas->offset++;
> > > + new_mas->offset++;
> > > + }
> > > +
> > > + mas_descend(mas);
> > > + parent = mte_to_node(new_mas->node);
> > > + mas_descend(new_mas);
> > > + mas->offset = 0;
> > > + new_mas->offset = 0;
> > > + }
> > > +}
> > > +
> > > +/**
> > > + * __mt_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> >
> > Can you make this comment more about what your code does instead of the
> > "one by one" description?
> >
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + * Note that the user needs to manually lock the source tree and the new tree.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > + int ret = 0;
> > > + MA_STATE(mas, mt, 0, 0);
> > > + MA_STATE(new_mas, new, 0, 0);
> > > +
> > > + mas_dup_build(&mas, &new_mas, gfp);
> > > +
> > > + if (unlikely(mas_is_err(&mas))) {
> > > + ret = xa_err(mas.node);
> > > + if (ret == -ENOMEM)
> > > + mas_dup_free(&new_mas);
> > > + }
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL(__mt_dup);
> > > +
> > > +/**
> > > + * mtree_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> >
> > Again, it's more interesting to state it uses the DFS preorder copy.
> >
> > It is also worth mentioning the superior allocation behaviour since that
> > is a desirable trait for many. In fact, you should add the allocation
> > behaviour in your cover letter.
> Okay, I will describe more in the next version.
>
> >
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > + int ret = 0;
> > > + MA_STATE(mas, mt, 0, 0);
> > > + MA_STATE(new_mas, new, 0, 0);
> > > +
> > > + mas_lock(&new_mas);
> > > + mas_lock(&mas);
> > > +
> > > + mas_dup_build(&mas, &new_mas, gfp);
> > > + mas_unlock(&mas);
> > > +
> > > + if (unlikely(mas_is_err(&mas))) {
> > > + ret = xa_err(mas.node);
> > > + if (ret == -ENOMEM)
> > > + mas_dup_free(&new_mas);
> > > + }
> > > +
> > > + mas_unlock(&new_mas);
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL(mtree_dup);
> > > +
> > > /**
> > > * __mt_destroy() - Walk and free all nodes of a locked maple tree.
> > > * @mt: The maple tree
> > > --
> > > 2.20.1
> > >