Re: autofs vs. Sun automount -- new fs proposal

Alexander Viro (viro@math.psu.edu)
Thu, 17 Dec 1998 05:04:14 -0500 (EST)


On Thu, 17 Dec 1998, Peter Miller wrote:

[snip]
> By "API" I mean a calling interface. E.g. the clients of the VFS
> are littered with
>
> if (inode->i_op && inode->i_op->method)
> err = inode->i_op->method(inode, ...)
> else
> err = -ESOMETHING;
>
> and it would be tidier, and more maintainable, if they said
>
> err = vfs_method(inode, ...);
>
> instead. This also allows insulation from VFS changes, by having
> the tests for numerous alternative i_op functions be elegantly
> concealed within the vfs_* wrapper, most modern alternative first.

;->>> See my posting in the same thread re clever bypass.
And look at vfs_{rmdir,rename,unlink} in fs/namei.c. It's a bit different,
since I apply permission checks there.

> I have created a set of such wrapper fucntions, each called
> vfs_<method>, one for each i_op, s_op, f_op member. Where locks
I'm not sure that _all_ of them need it.
> need to be taken around the method invocations, I do that inside
No. This way the madness lies. Locking bloody belongs to namei
manipulations.
> the wrappers. Ditto defaults and errors. BTW I notice that the
> -E<SOMETHING> values are sometimes inconsistent.

Oh, yes. And if you'll look at the stuff in filesystems it will
get even worse. Again, look at the changes in -ac (and I hope in -pre1 too
- didn't look at it yet). That's one of the reasons why I'm telling that
we should fix at least some stuff in filesystems first.

> And, yes, these are prime candidates for inline functions, for
> obvious performance reasons. While I'm debugging, however, they
> are also very useful places to print out what the hell is going on.

Hmm... Take a look at fs/nfsd/vfs.c. I don't think that inlining
this stuff will be good.

> > is this going
> > to be done in a stackable way?
>
> The work I'm doing could be loosely called "filesystem filters".
> By running file system accesses through a proxy, the proxy can
> "filter" the information. E.g. make a filesystem's filenames look
> uppercase, but make filename lookups case insensitive.

And that's why we need true featherweight layers, not just nullfs.
You'll get a heck of overhead this way.

> The idea is to mount such a filter "over the top" of an existing
> file system. (The kernel doesn't much care for this at present,
> for now the user sees filtering "sideways" rather than "vertically".)
>
> So, now you see why the "null proxy" is the first step. If I can
> proxy and do nothing, then I can take that and proxy and do something.
>
> (Someone mentioned ROFS... Read-only comes (almost) for free from the mount
> semantics of the kernel, other permissions are handled through the
> vfs_permission wrapper.)
>
> All this was to address your "stackable" query. Rather than do
> the stacking by adding extra kernel machinery, just allow mount to
> keep mounting more and more filters over the same point. (Yes,
> /sbin/mount and /sbin/umount and need to co-operate.) Consider:
> a simple "pair-wise" stack proxy: if you don't find it (an inode)
> in the other place, use the one under the mount point. Now, to
> get multiple levels of stacking, mount multiple "pair-wise" stacks
> over the one spot. (There is an "in-front-of" pair and an "in-back-of"
> pair, if you got this far.) Talking to the BSD union author, this
> is how BSD does it.

And at least in 3.0-current (FreeBSD, that is) pieces of unionfs
are still kludged upon vfs_lookup.c code and don't work 100% right. BTDT.
Tracing deadlocks wasn't fun. It should be done in cleaner way. Don't
forget, 4.4BSD doesn't implement full semantics proposed in original work
- they took a subset (original considered user-space and transport
layers... ;-<) We can pick a better one.

> (Yup, you probably spotted a *huge* problem with inode numbers this
> time. Just ignore it - few apps care. The nasty part is how to
> get getdents to handle duplicates elegantly - BSD gives the problem
> to libc - i.e. don't solve it in the kernel.)
>
> (BTW: it would be nice if there was a FS flag to say that the first
> arg needs to be a directory, like there is already a flag saying
> it needs to be a device. Another one for "leave me alone, I'll
> take care of it, just give me the string" would be nice, too.)
>
> Alexander Viro writes...
> > On Wed, 16 Dec 1998, Peter Miller wrote:
> > > I have a mostly-working 2.1 lofs.
> > Where?
>
> I haven't released anything, yet. I'll need a couple of days to
> give it a Makefile. (I'm using Aegis and Cook, not CVS and gmake,
> 'cause I like 'em better.)
>
> Folks interested can drop me a line, and I'll let you know when
> I've uploaded a copy for you to look at.

Consider me interested.

> > > It is a stepping stone to a bunch of other stuff I want to do.
>
> > ;-) Guess that we'ld better sync our activities in that area.
>
> See the above description. Any intersection?

Lots of them.

> Other filters include recode on filenames, recode on text data (two
> filters, stack 'em if you want both), dos-izing and un-dos-izing
> text files, ... The list goes on. But that's all window dressing;
> my big interest is hierachical storage management.
[snip the list]
> * updating to the latest Kernel src, and cope with latest vfs changes.
>
> * tracking down the locking requirement(s) for each of the vfs_<method>
> wrappers, so that it the wrapper does it correctly.

It doesn't belong there. Look: we can create a structure a-la
nameidata keeping the state of lookup/modify_namespace request. We can add
a flag to dentry. Instead of current positive/negative separation we'll
get
looking up (just d_alloc'ed)
positive, silent
negative, silent
negative, trying to become positive
positive, trying to become negative
source of rename in progress
Now, each request consists of two parts - lookup(s) from known dentry(ies)
and actual modification. Let the request leave a trace behind itself while
it's in the lookup phase and purge it as soon as it got to modification
phase. Trace can be kept in a page associated with request - put there an
array of list_head's and add a cyclic list to dentry (coming through the
pages of requests that had been there).
Now the interesting part:
a) lookups stop if they come to in-processing dentry (and wait
there). If they have a right to return -ENOENT (negative trying to become
positive) they do so.
b) if we are ready to begin modification phase we check traces of
involved dentries. If there are requests older than our one we sleep.
Otherwise we roll back all younger lookups (back to corresponding dentry),
mark dentries, wipe our own trace and go.

This way we know that at the moment when operation begins all
lookup pathes are clean - none of the operations in progress can affect
them. Moreover, at any moment there is at most one operation in progress
affecting a given dentry. In some filesystems that may mean that we can do
several requests on the same directory in parallel, but that's another
story. Now, horrible locking scheme of rename() goes away - we don't have
to keep a per-fs lock on renames. We can easily check whether rename is
allowed - if there is no sources of other renames on the path from
destination to the nearest common ancestor with the source we are fine
(sorry, it sounds horrible, but draw a picture and you'll see what I
mean). Many other races will go away - look through the minix or ext2 code
and you'll see (especially in case of minix).

I think that the right code path looks so:
somebody (usually syscall) creates a request -> lookup/namei code does its
thing, including locking -> it calls corresponding method wrapper -> it
passes the request to the right layer on the stack (good bypass routine is
MUST here, nullfs_bypass isn't nice) -> fs method being called, possibly
perusing underlying layers in stack.

Notice that this scheme will work quite nice, with stackable
layers or not - it doesn't care.

> * I seem to have a couple of reference counts wrong, and I need to
> track down each case, so that it the wrapper does it correctly.
>
> The last two usually require laboriously wading through fs/*.c
> files to figure out their expectations of what the filesystem vfs
> will do. It's not in Richard's VFS write-up, it's not in the books,
> it's not in Documentation/fs, "use the source Luke".

;-) vi has ctags and you have grep and vgrep. And yes, it's "Use
the Source" thing.

> I'd also like to be able to produce a document rather like Richard's
> VFS write-up, just by extracing the comments from each of my vfs_*.c
> files. Only with locking and reference counting and errors and
> defaults and snoot-load of other stuff explained in (even more)
> gory detail.
Would be nice. I can comment on namei and locking stuff, *except*
the FAT/MSDOS/VFAT/UMSDOS madness. It hurts. It really, really hurts. Bug
of the day: on VFAT filesystem say

mkdir foo.long
cd foo~1.lon
mkdir bar
mv ../foo.long bar/foo.long

... and watch the effect. *hard* *links* *on* *directories* *suck*, even
if they are just "aliases". Sheesh... Affs is other offender - it has
outright hardlinks on f*cking everything, directories included. Could some
kind soul tell WTF AmigaOS does with those beasts? Are they really used?

I'll probably put my code on anon ftp tomorrow - it had grown too
large to be posted on l-k. OTOH it depends on what went to -pre1...

Cheers,
Al

-- 
There are no "civil aviation for dummies" books out there and most of
you would probably be scared and spend a lot of your time looking up
if there was one. :-)			  Jordan Hubbard in c.u.b.f.m

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/