Re: Detecting changed files for JIT/metadata cacheing

Rich Paul (rich-paul@rich-paul.net)
Fri, 25 Jun 1999 19:19:08 -0400


Jamie Lokier wrote:

Sounds rather like some things that I've been interested in. Especially
the fact that changes would flow back up to /, so that you could check
if any file on the system had changed by stating one directory. Not too
useful in and of itself, but imagine a make system that could skip a
whole directory tree because it KNOWS that nothing there has changed.

There are a couple of things to think about:

1) How many bits do you want for this thing. /, /usr, and /home might
fly by faster than the eye can see.

2) If a program has a file open, is this number updated with each write
( which would really be each flush, close enough ) or only when the file
is closed?

3) Does a serial number or a date make more sense? A date would have
limited resolution ( maybe 1 second or 1/100 second, or 1/1000 second,
with a 64 bit field there would be PLEANTY or room ) a serial number
would be more atomic ... but a make system would have to know the change
count of all the files that went into the file that it's trying to
build.

4) What about publish/subscribe: Imagine this: program creates a
named
pipe or unix socket, and says "Any time somebody writes to this file, or
this directory, or this directory and any subdirectory, write a couple
of bytes to this file ... possibly an integer handle that maps back to
my internal map of files. This would be quite useful in something like
a file manager ... somebody creates a new file and it appears
immediately
in the browser, without the browser polling.

For that matter, it might be interesting to watch a file whose purpose
you don't know and see if anybody opens it to read ... I've had things
in my /usr/lib/*/*/* area that I had no way of judging to be useful or
useless.

Perhaps rather than a named pipe or unix socket, it would be better for
the kernel to create and open a file in the /dev or /proc areas, so
there was a way to check who was watching what ...

And one would have to think about the semantics ... are we watching the
inode, or watching the filename ... or both. Watching only the filename
would require watching the directory as well, in case of
links.

6) Imagine that you have the following
/home/rich/file1
/home/rich/file2
/home/rich/link1

link1 is a hard link to file1. Then, somebody makes it a hard link to
file2. They ( by chance ) have the same count ... In order to guarantee
correctness in your program, you'd have to keep track of both inode
number and count.

7) One more random thought ... wouldn't it be interesting if one could
set a property on a file, and the kernel would then create AND MAINTAIN
a md5 hash of that file ... so that you could do a system
call const char*md5stat(const char *), and get back a checksum, and
whenever the file has data written to it, the hacksum would be changed.
It would be slower than an integer counter, but if you copied a file
from one place to another, the checksum would be the same in the new
place. It would also slow down writing to a file, since the md5sum
would have to be updated with each flush ( that's fflush() not sync() ).

I don't read this list that much, just try to keep track of broad
strokes and interesting ideas, so keep me posted!
> What I want to achieve
> ----------------------
>
> I'd like to be able to know if a file has changed since the last time I
> cached information about it, efficiently.
>
> Why do I want this? To transparently cache results of operating on
> files -- and still provide guarantees of correctness. In my case I want
> to cache JIT-compiled code transparently.
>
> There are other uses.
>
> - Stuff like cacheing extracted metadata (to bring in a topical idea :-)
> - Cached checksums -- for security checks.
>
> Problem with existing mechanism
> -------------------------------
>
> I could just look at mtime & size but (a) mtime can be changed back (and
> often is) and (b) it doesn't have infinite resolution. Thus no
> guarantee of a correctness.
>
> I could do a checksum but that's too slow. I'm considering JIT-style
> transformation of big executables like Emacs & Netscape. Demand-page
> checksumming is possible (for my particular app), but still rather
> wasteful in time and memory.
>
> Proposal
> --------
>
> So I'd like something different from mtime: a modification version
> number that could not be reset, and is incremented whenever a modified
> file's version is read (which resets the modification flag). (This is
> to avoid incrementing on every modification which would overflow too
> quickly).
>
> And I'd like that to automatically flag parents modified. In normal
> use, where you don't look at the versions, this has no overhead. This
> is for a totally different application: speeding up large recursive
> directory searches for newly changed files.
>
> Implementing
> ------------
>
> Ext2 already has a version field per inode, and room for a "modified"
> flag. So it'd be a compatible extension and even the API already
> exists. (Unless the version field is used in some way I don't
> understand).
>
> The version field can be set which IMO rather defeats the point; perhaps
> that could be removed? I'd be on for allowing the version to be set
> _once_ for a newly created file (ctime can be checked) -- or just
> keeping it if we decide that you can always modify the filesystem using
> debugfs anyway.
>
> You may guess I don't care about other filesystems -- fall back to
> slow checksumming would be fine ;-)
>
> I'm happy to implement -- any interesting comments first?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/