New version of the ext3 indexed-directory patch

From: tytso@mit.edu
Date: Fri Sep 20 2002 - 11:41:12 EST


I've done a bunch of hacking on the ext3 indexed directory patch, and I
believe it's just about ready for integration with the 2.5 tree.
Testing and comments are appreciated.

The code can be found either via bitkeeper, at:

        bk://extfs.bkbits.net/ext3-htree-2.4
and
        bk://extfs.bkbits.net/ext3-htree-2.5

Or for those people who want straight diffs, patches against 2.4
and 2.5 can be found at:

        http://thunk.org/tytso/linux/ext3-dxdir

                                                - Ted

Here are my Release Notes for my changes:
        (i.e., the changes from Christopher Li's port of Daniel
        Phillip's hashtree code)

*) The Ext3 hash-tree code uses the new TEA and half-MD4 hash that ships
        with e2fsprogs 1.28

*) The code has been massively reorganized and rewritten to make it more
        maintainable. In particular, the horrible mess which had been
        ext3_find_entry and ext3_add_entry has been broken up into
        multiple functions. This eliminated a lot of goto's, and more
        importantly, found some memory leaks which have been eliminated.
        Some were only in the error paths, but some were in normal code
        paths that would be executed after a split, for example.

*) As a side effect of the reorganization, the ext3_find_entry() and
        ext3_readdir() paths can now support arbitrary numbers of
        indirect levels in the tree. (The resulting code was simpler,
        too!) The one code path that still could use some work is the
        split handling on the insert path. Once that is generalized, we
        will be able to remove the depth restriction on the indexed tree.

*) The error handling has been cleaned up, so that they are appropriately
        reported back up to userspace, where approach. Ext3_std_error()
        is now called when the journal routines report an error, in line
        with all of the other journalling calls in the rest of the ext3
        codebase.

*) Ext3_readdir now traverses the directory in hash sort order. This
        (mostly) solves the requirement that readdir not return doubled
        file entries, or skip a block of files, which could happen if
        another process managed to trigger a tree split while the
        readdir operation was going on. Unfortunately, on 32 bit
        machines, we have to use a 31-bit sort under all conditions,
        even if telldir64() is used in userspace. This is because the
        VFS uses a single path for sys_lseek() and sys_llseek(), and so
        the ext3 code can't tell whether to send back a 32-bit value or
        a 64-bit value. And, if we send a 64-bit positional value when
        the 32-bit lseek() was called, lseek() will bomb out.

        This is unfortunate, but the chances of our hitting this failure
        more are relatively small. Since readdir (well, sys_getdents64)
        returns 4k of data at at time, that means a directory with
        400,000 entries have approximately 3,000 getdents() system
        calls, with each getdents() returning approximately 133 entries.
        In order to trigger the problem, the 31-bit hash collision
        must be positioned such that it straddles one of the 4k getdents
        blocks --- while a split operation is taking place.

        (If we were willing to modify the VFS layer, so that
        ext3_readdir() knew how much space was left in the filldir
        buffer, we could be even smarter about things by stopping if
        there is only space for less than all of the entries that have
        the same hash value. I'm not entirely convinced this is
        necessary, though. The risk is low and in the worst case, one
        or two filenames will simply be returned twice by readdir, which
        shouldn't cause problems for most applications.)

*) Ext3_readdir(), ext3_add_entry(), and ext3_find_entry() now
        automatically fall back to the old linear directory code if they
        find a indexed directory format they don't understand. This
        allows us to add new hash versions, and/or allow the depth of
        the tree to be increased, without losing backwards compatiblity
        with deployed kernels.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Sep 23 2002 - 22:00:31 EST