Re: autofs vs. Sun automount -- new fs proposal

Peter Miller (peterm@jna.com.au)
Thu, 17 Dec 1998 12:16:56 +0100


Richard Gooch writes...
> Peter Miller writes:
> > You have to indirect the whole tree, no just the top-level directory,
> > otherwise 'pwd' goes bozo - e.g. loopback a read-only version of
> > /usr into a chrooted environment.
>
> Can you explain this a bit more? Do you mean that the whole mounted
> lofs has to be indirected, or the entire filespace does?

Basicly, my implementation is a "proxy" layer. Each inode contains
a reference to the "underlying" inode. Each VFS methods is proxied
(in the same sense as a firewall proxy) into the "original"
filesystem. This may be what the "brute force" approcach alluded
to, but I'm not certain.

This is the way the BSD "null" file system works (having talked to
the author). It does not impose a significant penalty, as there
is almost nothing to do! But you have to do it often, once for
each method, slightly differently each time.

> > And, yes, I agree that VFS changes would help (a *lot*). I've
> > written a bunch of vfs_* functions, to actually provide a consistent
> > (well, more consistent) and usable VFS API, capturing many of the
> > things that must happen before and after each of the VFS "methods"
> > are invoked. Is this what you had in mind, Richard?

> No, I haven't thought about it to that level. So far I've just been
> having random thoughts and musing on the list.
> Are you really just implementing "after" methods?

By "API" I mean a calling interface. E.g. the clients of the VFS
are littered with

if (inode->i_op && inode->i_op->method)
err = inode->i_op->method(inode, ...)
else
err = -ESOMETHING;

and it would be tidier, and more maintainable, if they said

err = vfs_method(inode, ...);

instead. This also allows insulation from VFS changes, by having
the tests for numerous alternative i_op functions be elegantly
concealed within the vfs_* wrapper, most modern alternative first.

I have created a set of such wrapper fucntions, each called
vfs_<method>, one for each i_op, s_op, f_op member. Where locks
need to be taken around the method invocations, I do that inside
the wrappers. Ditto defaults and errors. BTW I notice that the
-E<SOMETHING> values are sometimes inconsistent.

And, yes, these are prime candidates for inline functions, for
obvious performance reasons. While I'm debugging, however, they
are also very useful places to print out what the hell is going on.

> is this going
> to be done in a stackable way?

The work I'm doing could be loosely called "filesystem filters".
By running file system accesses through a proxy, the proxy can
"filter" the information. E.g. make a filesystem's filenames look
uppercase, but make filename lookups case insensitive.

The idea is to mount such a filter "over the top" of an existing
file system. (The kernel doesn't much care for this at present,
for now the user sees filtering "sideways" rather than "vertically".)

So, now you see why the "null proxy" is the first step. If I can
proxy and do nothing, then I can take that and proxy and do something.

(Someone mentioned ROFS... Read-only comes (almost) for free from the mount
semantics of the kernel, other permissions are handled through the
vfs_permission wrapper.)

All this was to address your "stackable" query. Rather than do
the stacking by adding extra kernel machinery, just allow mount to
keep mounting more and more filters over the same point. (Yes,
/sbin/mount and /sbin/umount and need to co-operate.) Consider:
a simple "pair-wise" stack proxy: if you don't find it (an inode)
in the other place, use the one under the mount point. Now, to
get multiple levels of stacking, mount multiple "pair-wise" stacks
over the one spot. (There is an "in-front-of" pair and an "in-back-of"
pair, if you got this far.) Talking to the BSD union author, this
is how BSD does it.

(Yup, you probably spotted a *huge* problem with inode numbers this
time. Just ignore it - few apps care. The nasty part is how to
get getdents to handle duplicates elegantly - BSD gives the problem
to libc - i.e. don't solve it in the kernel.)

(BTW: it would be nice if there was a FS flag to say that the first
arg needs to be a directory, like there is already a flag saying
it needs to be a device. Another one for "leave me alone, I'll
take care of it, just give me the string" would be nice, too.)

Alexander Viro writes...
> On Wed, 16 Dec 1998, Peter Miller wrote:
> > I have a mostly-working 2.1 lofs.
> Where?

I haven't released anything, yet. I'll need a couple of days to
give it a Makefile. (I'm using Aegis and Cook, not CVS and gmake,
'cause I like 'em better.)

Folks interested can drop me a line, and I'll let you know when
I've uploaded a copy for you to look at.

> > It is a stepping stone to a bunch of other stuff I want to do.

> ;-) Guess that we'ld better sync our activities in that area.

See the above description. Any intersection?

Other filters include recode on filenames, recode on text data (two
filters, stack 'em if you want both), dos-izing and un-dos-izing
text files, ... The list goes on. But that's all window dressing;
my big interest is hierachical storage management.

* I want a "monitor" proxy - tell me what is happening in a file
system by squirting inode activity notifications down a pipe/socket/fd
of some sort. This way, I can know active files from inactive
ones, and be up-to-the-second accurate, without ever scanning the
filesystem. (HSM is meant to be BIG and scanning doesn't scale.)
Inactive files get rolled out to slower media.

BUT you need to know what changed *early* so you can roll it out
immediately, so when you need to roll something in, you just
delete "old stuff", no need to roll "old stuff" out first and thrash.

That's all future stuff. But there is an immediate application:
how would you like to be able to have a running backup in real time?
Monitor proxy feeds inode list to user-mode program, which
pipes a "modified files" list into, say, cpio -o. (OK, its not
that simple, but you get the idea.)

* I want a generic "cache" proxy. The "other" place can be any
other file system - CD-Rom, NFS, automounted, etc, etc. Use the
monitor proxy underneath to feed a user-mode cache clearer.

Immediate benefit - I can access multiple CD-Roms at the some
time, if only I had a nice volume manager. (Cache over the whole
vold mount tree.) (By nice I mean like Solaris only better.)

If it works for CD-Roms, it can work for Zip or Jaz or Syquest
or SuperDrives or just plain floppies. (Yeah, I know about
blocking and non-blocking cache write issues. Later.)

This is the second level of hierachy.

* I'd like an append-blocks-only-but-looks-read-write file system for
tapes (this isn't a proxy) like you can get (could get?) for WORM
drives. It'll go like a dog, which is why I need the cache first.

This could be a append-blocks-only-but-looks-read-write proxy
device driver with some sort of file system mounted on
it. (You didn't think I'd let an opportunity to say "proxy" get
away, did you?) (The file system used needs to transparently
cope with elasic underlying media.)

This is the third level of hierachy.

* A snoot-load of other machinery, as described in the MSSRM, to
make it all happen and be monitorable and configurable and manageable.
(Yeah, I know, the MSS working group did a major back-flip last
year, but I think their new stuff (a) stinks and (b) is a commercial
win for a proprietary vendor.)

Derrick J Brashear writes
> On Wed, 16 Dec 1998, Peter Miller wrote:
> > I have a mostly-working 2.1 lofs.
>
> What's left to do in it? I might have some spare time by the end of the
> week if I get somewhere with sparc audio crap

lofs...

* updating to the latest Kernel src, and cope with latest vfs changes.

* tracking down the locking requirement(s) for each of the vfs_<method>
wrappers, so that it the wrapper does it correctly.

* I seem to have a couple of reference counts wrong, and I need to
track down each case, so that it the wrapper does it correctly.

The last two usually require laboriously wading through fs/*.c
files to figure out their expectations of what the filesystem vfs
will do. It's not in Richard's VFS write-up, it's not in the books,
it's not in Documentation/fs, "use the source Luke".

I'd also like to be able to produce a document rather like Richard's
VFS write-up, just by extracing the comments from each of my vfs_*.c
files. Only with locking and reference counting and errors and
defaults and snoot-load of other stuff explained in (even more)
gory detail.

Regards
Peter Miller E-Mail: millerp@canb.auug.org.au
/\/\* WWW: http://www.canb.auug.org.au/~millerp/
Disclaimer: The opinions expressed here are personal and do not necessarily
reflect the opinion of my employer or the opinions of my colleagues.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/