Re: Is any file system on Linux appropriate for very large directories

Kai Henningsen (kai@khms.westfalen.de)
27 Jul 1996 22:35:00 +0200


alan@lxorguk.ukuu.org.uk (Alan Cox) wrote on 27.07.96 in <m0ukD1n-0005FbC@lightning.swansea.linux.org.uk>:

> > Anyway, here is my current "perfect hashing" db source. As I explained
> > before, it's currently used to keep track of message ids, not file names;
> > and it's performing quite well with an amount of data I didn't expect to
> > handle when I coded it.
>
> Very interesting from a different angle. Whats its worst case add/delete
> time like. We have another "mostly lookup", performance critical search
> problem - socket demultiplexing

Well, as I said, it doesn't shrink (just like typical Unix directories),
so delete is constant at one lookup + one bucket write.

As for insert, you have again one lookup , one or (on split) two bucket
write, one item write, and possibly (very seldom) one main directory
duplication. Main directory duplication does exactly that - appends a copy
of the main directory to the end of same.

Oh yeah, and when you split the bucket, you have to update the main
directory.

Numbers again - this is my current database:

Volume in drive I is NEWS Serial number is 26BB:3415
Directory of i:\xp\dupes\*.*

. <DIR> 5.07.96 18.47
.. <DIR> 5.07.96 18.47
msgdb.bkt 20135936 27.07.96 21.45
msgdb.dir 16386 27.07.96 21.48
msgdb.rec 98994132 27.07.96 21.45
119.146.454 bytes in 3 files and 2 dirs 119.177.216 bytes allocated
349.323.264 bytes free

That is, the main directory is currently about 0.014% of the database
size. (Hmm. And I don't believe the "allocated" number; this is HPFS,
which allocates in multiples of sectors. Oh well.)

As for socket multiplexing - hmm. I wonder if you'd need a hash there,
that is, if the values are sufficiently random per se. Probably not ...
And of course, the timing is seriously different for an in-memory
algorithm; the linear search in the buckets may well become a problem
there.

I suspect this one is not ideal for in-memory searches. It is designed for
a situation where I/O dominates the timing.

Anyway, I can definitely recommend the book I got this from. It does
exactly what I bought it for: tell me about lots of different algorithms,
in useful detail. It's surprising how many algorithm books document, for
example, one or two different tree versions in useable detail and are
content with that (they may mention more, but that doesn't allow me to
implement those, let alone find out if it's a good idea!) :-(

It's _Handbook_of_Algorithms_and_Data_Structures_ In Pascal and C, Second
Edition (1991), G. H. Gonnet, R. Baeza-Yates, Addison-Wesley, ISBN 0-201-
41607-7. [While I don't see an explicit mention, it looks like it's made
with TeX. This impression is further enforced by their offering of $4 for
each first report of an error.] They have extensive bibliographies (36
textbooks, 1350 (no typo) papers), and lots of run time calculations and
simulation timings. 424 pages including the index, hard cover. Cost me DM
89,- and is definitely worth it.

MfG Kai