Re: Questions on prio_tree

From: Rajesh Venkatasubramanian
Date: Sun Apr 18 2004 - 18:48:24 EST

Hi Kevin,

Thanks for you analysis and comments.

> 1 - Why does prio_tree_expand use the max_heap_index to determine the
> height of the radix tree? The bits of the radix_index is only ever used
> (never the bits in the heap_index), so shouldn't max_radix_index be used?

In some cases heap_index is also used. As you noted below sometimes
size = heap_index - radix_index is used for indexing the tree. If you
consider radix_index = 0UL, then size == heap_index.

> I understand that using the max_heap_index isn't necessary broken (it is
> guaranteed to be greater than or equal to the max_radix_index), but it
> seems inefficient. As an example, consider the case where 100 vmas are
> created all starting at address 0 and ending at addresses 0 through 100.
> The current code will force the address part of the tree to a height of 7,
> but since there is only one start address (0) the tree really could have a
> height of just 1.

Good example. Actually, your example is the exact reason why we use
max_heap_index rather than max_radix_index.

It is clear that we cannot store 101 vmas if the height of the tree is
just 1. We need at least 7 levels to store 101 vmas. Yes. The first 7
levels of the tree will be skewed because radix_index (all 0) is used
for indexing first 7 (root->index_bits) levels. From level 8 to 14 the
size (== heap_index because radix_index is always 0) is used for indexing
the tree. So from level 8 to 14 it will be a fully packed (dense) radix
tree where size (== heap_index) is used as the index.

Yes. This is bit inefficient. But, it provides a clean solution for a
corner case where all vmas start from the same pgoff (radix_index)
with different end addresses. This is a rare corner case, so we don't
have to worry much about the performance here.

> 2 - When traversing the tree, after the first root->index_bits levels of
> the tree have been walked, the code starts inspecting the "size" portion of
> the radix key. Why does the code set the start bit for the size portion to
> root->index_bits? Shouldn't the code set the start bit to (BITS_PER_LONG -
> Consider the case where 100 vmas are created all starting at address 0 and
> ending at addresses 0 through 100, followed by the creation of a vma from
> address 100,000 - 100,001. After the first 100 vmas, the tree will have a
> height of 7 (reasons sited above) for the start address and an additional
> height of 7 for the size portion of the index.


> This will cause the tree to
> have 7 records in the start_address portion of the tree, and the remaining
> 93 will be distributed pretty evenly in the size portion of the radix tree
> that hangs off the zero address.


> When the 101st vma is added, the start
> address portion of the tree will be expanded to a height of 17. The
> current code will expand the tree to contain 17 records in the size portion
I think it is typo: 17 records in the start_address portion

> of the radix tree and leave the remaining 84 records in the size portion of
> the tree hanging off the zero address.


> The problem, as I see it, is that new inserts into the tree that start at
> address zero will not be inspecting the same bits that were used when the
> first 100 were inserted. (When the first 100 were inserted, bits 1-7 were
> used to branch in the size portion of the tree, but inserts 102 and on will
> use bits 1-17.)

Thanks for your analysis. This is the reason why the expand algorithm is
O((log n)^2) algorithm.

Check the prio_tree_remove in prio_tree_expand. If the tree is expanded
more than 1 index_bit, then we start removing nodes from the current tree
and form a new skewed tree.

In your example, the index_bits changes from 7 to 17. So, (17 - 7) - 1
= 9 nodes will be removed from the current tree before adding the 101st
vma. So, at the end of the removal of 9 nodes from the current tree,
there are 91 vmas (7 in the skewed start_address portion and the
remaining 84 in the size portion that hangs of the radix_index zero).
Now, the 9 removed vmas and the newly added vma forms a skewed tree of
height 10. The current tree with 91 vmas is appened at the end of this
10 vma skewed tree forming a 17 vma skewed tree in the start_address
portion and 84 vma dense tree in the size portion.

> I don't think this can cause a "crash", but I think it
> skews the tree ordering and could disturb the O(log n + m) behavior of the
> algorithm.

Even when the tree is skewed we guarantee O(log n + m). Yeah, we hide
the multiplicative constant (2). But then, that is the beauty of big-O
notation, right? Hiding the practical performance issues.

The tree gets skewed only in very rare corner cases. So the performance
issues should not matter much, IMHO.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at