Re: If not readdir() then what?

From: Neil Brown
Date: Wed Apr 11 2007 - 18:32:55 EST


On Wednesday April 11, tytso@xxxxxxx wrote:
> On Wed, Apr 11, 2007 at 07:15:57AM +1000, Neil Brown wrote:
> > > Neil, is this correct?
> >
> > It's just NFS. nfsd does open/getdents/close on every readdir
> > request.
>
> That's... unfortunate.

I like to think of it as "challenging" :-)

>
>
> So the problem with nfsd doing an open/getdents/close on every readdir
> request is two-fold. First of all, it means that you are incurring
> extra overhead by forcing ext3 to allocate, create, and free the
> red-black tree on every single NFS readdir request. Secondly, it
> increases the chances of problems with the hash collision logic.

I guess we need a two-fold solution then :-)

For the first:
You are storing an internal tree representation of part of the
directory attached to the 'struct file'.
Would it be possible to store this attached to the page via the
->private pointer? What would avoid the allocate/create/free
overhead on every request.

You suggest caching the open files in nfsd. While that is probably
possible (I've thought of it a number of times) is would also be
quite complex, e.g. requiring some sort of call-back to close all
those files when the filesystem is unexported. And it is very easy
to get caching heuristics wrong. Leveraging the page-cache which is
a very mature cache seems to make a lot of sense.

The cursor would still be in the 'struct file', but if the second
second fold works, that should be much less critical.

For the second.
You say that you " would need at least 96 bits in order to make that
guarantee; 64 bits of hash, plus a 32-bit count value in the hash
collision chain". I think 96 is a bit greedy. Surely 48 bits of
hash and 16 bits of collision-chain-position would plenty. You would
need 65537 entries before a collision was even possible, and
billions before it was at all likely. (How big does a set of 48bit
numbers have to get before the probability that "No subset of 65536
numbers are all the same" drops below 0.95?)

This would really require that the collision-chain-index was stable
across create/delete. Doing that while you have the tree in the
page cache is probably easy enough. Doing it across reboots is
probably not possible without on-disk changes.

So while my two-fold solution is not perfect(*), I think it is
substantially better than what we have now, and is extremely unlike to
ever show an observable problem.
(*)And perfection can only be achieved we two independent indexes.


This requires 64bit cookies which NFSv2 cannot handle. I think it is
OK to say "tough". We should probably make sure that READDIRv2 *always*
fails on a directory which is indexed, and possibly on a filesystem
which allows index directories. People can always disable indexing if
they want a v2 export.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/