Re: Mercurial 0.4b vs git patchbomb benchmark

From: Matt Mackall
Date: Fri Apr 29 2005 - 15:28:48 EST

On Fri, Apr 29, 2005 at 12:50:55PM -0700, Linus Torvalds wrote:
> On Fri, 29 Apr 2005, Matt Mackall wrote:
> >
> > Here's an excerpt from on how
> > the back-end works.
> Any notes on how you maintain repository-level information?
> For example, the expense in BK wasn't the single-file history, it was the
> _repository_ history, ie the "ChangeSet" file. Which grows quite slowly,
> but because it _always_ grows, it ends up being quite big and expensive to
> parse after three years.
> Ie do you have the git kind of "independent trees/commits", or do you
> create a revision history of those too?

The changeset log (and everything else) has an external index. The
index is basically an array of (base, offset, length, parent1-hash,
parent2-hash, my-hash). This has everything you need to reconstruct a
given file revision with one seek/read into the data stream itself,
and also everything you need for doing graph merging.

This is small enough (68 bytes, currently) that the index for a
million changesets can be read into memory in a couple seconds or so,
even in Python. It can also be mmapped and random accessed since the
index entries are fixed-sized. (And it's already stored big-endian.)

So you never have to read all the data. You also never need more than
a few indices in memory at once. And you never have to rewrite the
data (it's all append-only), except to do a bulk copy when you break a

Mathematics is the supreme nostalgia of our time.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at