Re: [PATCH] AFS filesystem for Linux (2/2)

From: David Howells (dhowells@cambridge.redhat.com)
Date: Fri Oct 04 2002 - 10:34:32 EST

Next message: James Bottomley: "Re: an open letter to George Soros"
Previous message: OGAWA Hirofumi: "Re: FAT/VFAT and the sync flag"
Next in thread: Jan Harkes: "Re: [PATCH] AFS filesystem for Linux (2/2)"
Reply: Jan Harkes: "Re: [PATCH] AFS filesystem for Linux (2/2)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Jan,

> We don't. Coda has the all or nothing fetch. When we get an open upcall and
> the file isn't cached we get the whole file, and return the filehandle we
> just used to fetch the file, that way even when pages haven't been flushed
> to disk yet, the kernel will see the same data. All reads and writes are
> wrapped in such a way that readpage and writepage directly access the cache
> file, when an mmap is active the Coda inode isn't even touched anymore.

So, say you've got a really big file in your cache. Someone changes a single
byte in it. You can work out that the file has changed by one of a number of
means. How do you avoid having to throw the entire file back to the server?

> Pretty much, but we need that extra space for the disconnected writes
> anyways, that way we can always roll back to a consistent version.

How do you sync up with the server on reconnection? This is similar to the
problem you pointed out that I have to solve - how to deal with the file on
the server being changed by another client whilst I'm also trying to write to
it.

> Ok, traditional AFS semantics is 'session semantics'. Very strict, whenever
> you read from a file, everything is consistent wrt. the time you opened the
> file. Whatever you write isn't committed on the server until you close the
> file. This model has great advantages such as minimizing network traffic,
> giving lock free read/write consistency.

Whatever model you choose, you have to accept some compromises.

The model I'm thinking of is as follows:

(*) Data that the client doesn't have immediately to hand (ie is not yet
cached) it fetched a chunk at a time upon request of the VM (thus
allowing data readahead to be driven by the VM).

(*) All data cached by a client for a particular file is zapped if I get a
callback from the server and/or the data version number of that file
appears to have changed.

(*) In O_SYNC mode, data is written back to the server as promptly as
possible within the write() call (maybe through the auspices of
prepare_write and commit_write).

(*) In non-O_SYNC mode, I would like the data to be written back through the
     page cache's writepage() routine(s). By setting the dirty bits on pages,
     the write will be scheduled by the VM at some point. This would permit
     better write coelescing locally. However, security becomes a problem,
     since I have to say to the server which user I'm doing a store as, and if
     the data is coelesced from writes done as several different users, then
     there could be a problem if the store is rejected.

How does Coda deal with this security problem?

I admit there are a number of problems with this model that might be
alleviated by better operations being available in the AFS spec (such as data
insertion without having to nominate a new EOF position, and data appending
without needing to know the old or specify a new EOF position.

> Now people started throwing big databases in the filesystem, and the cache
> issues became important. So they introduced 'chunked access', dirty chunks
> are still written when the file is closed, but also when the cache is full
> (oops, lost write consistency).

Anyone using a network filesystem of any type to store a big databases is
probably just asking for trouble. IMHO they're far better off talking to a
distributed DB through its own network access protocol. But that's besides the
point.

> > (4) "Diff" the page in the pagecache against a copy stored in the cache
> > and try to send the changes to the server.
>
> As far as I know this is impossible without changing the existing AFS
> servers.

It can be done entirely in the client, provided it has a copy of the
unmodified page still in its cache.

> Doesn't AFS3 give callbacks when it updates a chunk of the file? I guess it
> still has retained at least that part of the original semantics, send
> callbacks when the file is closed (and the data is 'officially'
> committed). It is still up in the air what clients see that read the file
> between the chunked writes and the actual file close.

All it says is that a given file has changed. It doesn't provide a clue as to
where. Hence the entire file has to be flushed.

> Disconnected operation has never been 'AFS semantics'. That's a Coda thing.

I didn't say it was. It's explicitly denied in the AFS docs, though it's
discussed under the future developments section.

> A Coda cell is simply a FQDN, whenever the userspace cachemanager accesses a
> new cell it a locally unique ID, which will exist as long as there are
> objects from that cell in the cache.

The 32-bit ID still has to be mapped through a catalogue somewhere (may be a
text file that the cache manager reads on starting), and if the catalogue is
external to the cache, the catalogue may change what the ID corresponds to
without the cache being invalidated.

> Why would it probably be swapped out to disc? If you're really worried about
> that you could mlock the memory. And if you think that is too expensive, it
> is still better to mlock memory in userspace that to allocate that same
> memory in kernel space.

I'm storing mine on disc as do normal disc-based FS's. That means it can be a
lot bigger. Besides, you don't really want to mlock memory or store it in the
kernel - that would be a big chunk of memory permanently committed and
unavailable for other uses.

> Yeah, that's why Coda is using a recoverable VM, basically a mmapped
> file with an log where modifications are recorded so that we can
> replay/rollback uncommitted operations when we're restarting.

What's "VM" in this context?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: James Bottomley: "Re: an open letter to George Soros"
Previous message: OGAWA Hirofumi: "Re: FAT/VFAT and the sync flag"
Next in thread: Jan Harkes: "Re: [PATCH] AFS filesystem for Linux (2/2)"
Reply: Jan Harkes: "Re: [PATCH] AFS filesystem for Linux (2/2)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Oct 07 2002 - 22:00:45 EST