Re: Filesystem optimization..

Eric W. Biederman (ebiederm+eric@npwt.net)
29 Dec 1997 22:08:45 -0600


>>>>> "MR" == Michael O'Reilly <michael@metal.iinet.net.au> writes:

MR> People also have pointed out things like btree based directory trees etc,
MR> but btree directories are a win when you have large directories, as
MR> oppossed to lots of directories.

Actually a good implmentation holds _all_ directories in a single
btree. This is a win as usually directories are smaller than a single
disk block, so multiple directories may share storage. And also since
things are sorted, you don't need to search the entire directory.

But this appears to be accedemic for the moment.

MR> The critical function I'm trying to optimize is the latency of the
MR> open() system call.

MR> In practise, on large server, it's rare to get a very high level of
MR> cache hits (3 million file filesystem would need 384K of ram just to
MR> hold the inode tables in the best case, ignoring all the directories,
MR> the other meta-data, and the on-going disk activity).
>>
>> Perhaps the directory cache is too small for your machine?

MR> There are around 390,000 directories holding those files. Just how big
MR> did you want to the directory cache to get!?

The default size is 128 entries or making for a total of 256 entries
in a two level cache, in the stable kernels. It might be worth
increasing DCACHE_SIZE some. The development series seems to
increase this to about 1024, and this is extended with chaining.

MR> The point is that caching simply won't work. This is something very
MR> close to random open()'s over the entire filesystem. Unless the cache
MR> size if greater than the meta-data, the cache locality will always be
MR> very poor.

MR> So: Given that you _are_ going to get a cache miss, how do you speed
MR> it up? The obvious way is to try and eliminate the seperate inode
MR> seek.

Another thing in the area of seeking that may be worth doing is
checking to see if the kernel actually uses an elevator algorithm.
I go the impression a while back that it does first come first serve
for disk access. A little optimizing of the order (if it is cached),
might help.

But the truth is you should aim to have as much of your directory
structure and inodes in RAM as you can. With only 3 million files and
1GB of RAM you could allocate 350 bytes per file. Of course something
quite that simple would be silly, but at 3 million files you are not
out of the range (except perhaps pocket book wise), or using RAM for a
considerable cache, even on the intel architecture. With only 30,000
directories you could probably keep them all in RAM let's see at 2K
each that would be only 58M, a large but doable number.

If you want to play around, you could run my shmfs filesystem as a test.
It has the deficiency that it loses everything at shutdown, but in
every test I have run it seems to be as fast or faster than ext2. And
it keeps all of it's inodes in RAM, and all of the page information.
It's at http://www.npwt.net/~ebiederm/files/shmfs-0.0.020.tar.gz.
And that's my shameless plug for beta testers :)

MR> The filenames are all 8 letters long. The issue isn't the directory
MR> cache. The issue is the (IMHO) large number of seeks needed to read
MR> the first block of a file.

I used the filename example only because it is easy to see. The point
is that in the worst case ext2 isn't very good as you know.

Eric