Re: Tux3 Report: Initial fsck has landed

From: Daniel Phillips
Date: Fri Mar 22 2013 - 01:42:10 EST

Next message: Li Zefan: "Re: [PATCH 1/2] sysfs: fix race between readdir and lseek"
Previous message: Linus Torvalds: "Re: [CFT] Re: VFS deadlock ?"
In reply to: Dave Chinner: "Re: Tux3 Report: Initial fsck has landed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Dave,

Thank you for your insightful post. The answer to the riddle is that
the PHTree scheme as described in the link you cited has already
become "last gen" and that, after roughly ten years of searching, I am
cautiously optimistic that I have discovered a satisfactory next gen
indexing scheme with the properties I was seeking. This is what
Hirofumi and I have busy prototyping and testing for the last few
weeks. More below...

On Thu, Mar 21, 2013 at 6:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Wed, Mar 20, 2013 at 06:49:49PM -0700, Daniel Phillips wrote:
>> At the moment we're being pretty quiet because of being in the middle
>> of developing the next-gen directory index. Not such a small task, as
>> you might imagine.
>
> The "next-gen directory index" comment made me curious. I wanted to
> know if there's anything I could learn from what you are doing and
> whether anything of your new algorithms could be applied to, say,
> the XFS directory structure to improve it.
>
> I went looking for design docs and found this:
>
> http://phunq.net/pipermail/tux3/2013-January/001938.html
>
> In a word: Disappointment.

Me too. While I convinced myself that the PHTree scheme would scale
significantly better than HTree while being only modestly slower than
HTree in the smaller range (millions of files) even the new scheme
began hitting significant difficulties in the form of write
multiplication in the larger range (billions of files). Most probably,
you discovered the same thing. The problem is not so much about
randomly thrashing the index, because these days even a cheap desktop
can cache the entire index, but rather getting the index onto disk
with proper atomic update at reasonable intervals. We can't accept a
situation where crashing on the 999,999,999th file create requires the
entire index to be rebuilt, or even a significant portion of it. That
means we need ACID commit at normal intervals all the way through the
heavy create load, and unfortunately, that's where the write
multiplication issue rears its ugly head. It turned out that most of
the PHTree index blocks would end up being written to disk hundreds of
times each, effectively stretching out what should be a 10 minute test
to hours.

To solve this, I eventually came up with a secondary indexing scheme
that would kick in under heavy file create load, to take care of
committing enough state to disk at regular intervals that remounting
after a crash would only lose a few seconds of work. With this, PHTree
would satisfy all the performance goals we set out for it, which can
be summarized as: scale smoothly all the way from one file per
directory to one billion files per directory.

The only really distasteful element remaining was the little matter of
having two different directory indexes, the PHTree and the temporary
secondary index. That seems like one index too many. Then the big aha
landed out of the blue: we can actually throw away the primary BTree
and the secondary index will work fine all on its own. So the
secondary index was suddenly promoted to a "next gen" primary index.
This new index (which does not yet have a name) is based on hashing
and sharding and has nothing whatsoever to do with BTrees. It
currently exists in prototype with enough implementation in place to
get some early benchmarks, but we are not quite ready to provide those
details yet. You are completely welcome to call me a tease or be
sceptical, which is just what I would do coming from the other side,
but just now we're in the thick of the heavy work, and the key as we
see it is to keep on concentrating until the time is right. After all,
this amounts to the end of a ten year search that began around the
time HTree went into service in Ext3. Another couple of weeks hardly
seems worth worrying about.

> Compared to the XFS directory structure, the most striking
> architectural similarity that I see is this:
>
> "the file bteee[sic] effectively is a second directory index
> that imposes a stable ordering on directory blocks".
>
> That was the key architectural innovation in the XFS directory
> structure that allowed it to provide the correct seekdir/telldir/NFS
> readdir semantics and still scale. i.e. virtually mapped directory
> entries. I explained this layout recently here:
>
> http://marc.info/?l=linux-ext4&m=136081996316453&w=2
> http://marc.info/?l=linux-ext4&m=136082221117399&w=2
> http://marc.info/?l=linux-ext4&m=136089526928538&w=2
>
> We could swap the relevant portions of your PHTree design doc with
> my comments (and vice versa) and both sets of references would still
> make perfect sense. :P
>
> Further, the PHTree description of tag based freespace tracking is
> rather close to how XFS uses tags to track free space regions,
> including the fact that XFS can be lazy at updating global free
> space indexes. The global freespace tree indexing is slightly
> different to the XFS method - it's closer to the original V1 dir
> code in XFS (that didn't scale at all well) than the current code.
> However, that's really a fine detail compared to all the major
> structural and algorithmic similarities.
>
> Hence it appears to me that at a fundamental level PHTree is just a
> re-implementation of the XFS directory architecture. It's definitely
> a *major* step forward from HTree, but it can hardly be considered
> revolutionary or "next-gen". It's not even state of the art. Hence:
> disappointment.

Insightful indeed, and right on the money. I had no idea we were
reinventing XFS to that extent and would love to spend some time later
dwelling on the details. But at this point, I am willing to cautiously
characterize all that as history, based on the performance numbers we
are seeing from the next gen prototype. We plan to publish details
fairly soon. I will apologize again for the lag. I can only plead that
this kind of work just seems to take more time than it reasonably
should.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Li Zefan: "Re: [PATCH 1/2] sysfs: fix race between readdir and lseek"
Previous message: Linus Torvalds: "Re: [CFT] Re: VFS deadlock ?"
In reply to: Dave Chinner: "Re: Tux3 Report: Initial fsck has landed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]