Re: [STUPID TESTCASE] ext3 htree vs. reiserfs on 2.5.40-mm1

From: Andreas Dilger (adilger@clusterfs.com)
Date: Tue Oct 01 2002 - 15:43:30 EST


On Oct 01, 2002 23:59 +0400, Paul P Komkoff Jr wrote:
> This is the stupidiest testcase I've done but it worth seeing (maybe)
>
> We create 300000 files named from 00000000 to 000493E0 in one
> directory, then delete it in order.
>
> Tests taken on ext3+htree and reiserfs. ext3 w/o htree hadn't
> evaluated because it will take long long time ...
>
> both filesystems was mounted with noatime,nodiratime and ext3 was
> data=writeback to be somewhat fair ...

Why do you think data=writeback is better than data=journal? If the
files have no data then it should not make a difference.

> real user sys
> reiserfs:
> Creating: 3m13.208s 0m4.412s 2m54.404s
> Deleting: 4m41.250s 0m4.206s 4m17.926s
>
> Ext3:
> Creating: 4m9.331s 0m3.927s 2m21.757s
> Deleting: 9m14.838s 0m3.446s 1m39.508s
>
> htree improved this a much but it still beaten by reiserfs. seems odd
> to me - deleting taking twice time then creating ...

This is a known issue with the current htree code (not the algorithm
or the on-disk format, luckily). The problem is that inodes are being
allocated essentially sequentially on disk. If you are deleting in
creation order (as you are) then you are randomly dirtying directory
leaf blocks, and if you are deleting in readdir() order, then you are
randomly dirtying inode blocks.

As a result, if the size of the directory + inode table blocks is larger
than memory, and also larger than 1/4 of the journal, you are essentially
seek-bound because of random block dirtying.

This can be fixed by changing the inode allocation routines to allocate
inodes in "chunks" which correspond to the leaf page for which the
dirent is being allocated. This will try to keep the inodes for a given
directory block relatively close together on disk and greatly improve
delete performance.

You should see what the size of the directory is at its peak (probably
16 bytes * 300k ~= 5MB, and add in the size of the directory blocks
(128 bytes * 300k ~= 38MB) and make the journal 4x as large as that,
so 192MB (mke2fs -j -J size=192) and re-run the test (I assume you have
at least 256MB+ of RAM on the test system).

What is very interesting from the above results is that the CPU usage
is _much_ smaller for ext3+htree than for reiserfs. It looks like
reiserfs is nearly CPU-bound by the tests, so it is unlikely that they
can run much faster. In theory, ext3+htree run at the CPU time if we
fixed the allocation and/or seeking issues.

Cheers, Andreas

--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Oct 07 2002 - 22:00:28 EST