Re: Versioning file system

From: Jack Stone
Date: Tue Jun 19 2007 - 03:55:29 EST

Next message: Artem Bityutskiy: "Re: [PATCH/RFC] oops and panic message logging to MTD"
Previous message: Jiri Slaby: "Re: Question about a strange behavior of copy_to_user() in ioctlcall"
In reply to: Kyle Moffett: "Re: Versioning file system"
Next in thread: Bron Gondwana: "Re: Versioning file system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Kyle Moffett wrote:
> On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote:
>>> The question remains is where to implement versioning: directly in
>>> individual filesystems or in the vfs code so all filesystems can use it?
>>
>> Or not in the kernel at all. I've been doing versioning of the types
>> I described for years with user space code and I don't remember
>> feeling that I compromised in order not to involve the kernel.
>>
>> Of course, if you want to do it with snapshots and COW, you'll have to
>> ask where in the kernel to put that, but that's not a file versioning
>> question; it's the larger snapshot question.
>
> What I think would be particularly interesting in this domain is
> something similar in concept to GIT, except in a file-system:
> 1) Redundancy is easy, you just ensure that you have at least "N"
> distributed copies of each object, where "N" is some function of the
> object itself.
> 2) Network replication is easy, you look up objects based on the SHA-1
> stored in the parent directory entry and cache them where needed (IE:
> make the "N" function above dynamic based on frequency of access on a
> given computer).
> 3) Snapshots are easy and cheap; an RO snapshot is a tag and an RW
> snapshot is a branch. These can be easily converted between.
> 4) Compression is easy; you can compress objects based on any
> arbitrary configurable criteria and the filesystem will record whether
> or not an object is compressed. You can also compress differently when
> archiving objects to secondary storage.
> 5) Streaming fsck-like verification is easy; ensure the hash name
> field matches the actual hash of the object.
> 6) Fsck is easy since rollback is trivial, you can always revert to an
> older tree to boot and start up services before attempting resurrection
> of lost objects and trees in the background.
> 7) Multiple-drive or multiple-host storage pools are easy: Think the
> git "alternates" file.
> 8) Network filesystem load-balancing is easy; SHA-1s are essentially
> random so you can just assign SHA-1 prefixes to different systems for
> data storage and your data is automatically split up.
>
>
> Other issues:
>
> Q. How do you deal with block allocation?
> A. Same way other filesystems deal with block allocation.
> Reference-counting gets tricky, especially across a network, but it's
> easy to play it safe with simple cross-network refcount-journalling.
> Since the _only_ thing that needs journalling is the refcounts and
> block-free data, you need at most a megabyte or two of journal. If in
> doubt, it's easy to play it safe and keep an extra refcount around for
> an in-the-background consistency check later on. When networked-gitfs
> systems crash, you just assume they still have all the refcounts they
> had at the moment they died, and compare notes when they start back up
> again. If a node has a cached copy of data on its local disk then it
> can just nonatomically increment the refcount for that object in its own
> RAM (ordered with respect to disk-flushes, of course) and tell its peers
> at some point. A node should probably cache most of its working set on
> local disk for efficiency; it's trivially verified against updates from
> other nodes and provides an easy way to keep refcounts for such data.
> If a node increments the refcount on such data and dies before getting
> that info out to its peers, then when it starts up again its peers will
> just be told that it has a "new" node with insufficient replication and
> they will clone it out again properly. For networked
> refcount-increments you can do one of 2 things: (1) Tell at least X many
> peers and wait for them to sync the update out to disk, or (2) Get the
> object from any peer (at least one of whom hopefully has it in RAM) and
> save it to local disk with an increased refcount.
>
> Q. How do you actually delete things?
> A. Just replace all the to-be-erased tree and commit objects before a
> specified point with "History erased" objects with their SHA-1's
> magically set to that of the erased objects. If you want you may delete
> only the "tree" objects and leave the commits intact. If you delete a
> whole linear segment of history then you can just use a single "History
> erased" commit object with its parent pointed to the object before the
> erased segment. Probably needs some form of back-reference storage to
> make it efficient; not sure how expensive that would be. This would
> allow making a bunch of snapshots and purging them logarithmically based
> on passage of time. For instance, you might have snapshots of every 5
> minutes for the last hour, every 30 minutes for the last day, every 4
> hours for the last week, every day for the last month, once per week for
> the last year, once per month for the last 5 years, and once per year
> beyond that.
>
> That's pretty impressive data-recovery resolution, and it accounts for
> only 200 unique commits after it's been running for 10 years.
>
> Q. How do you archive data?
> A. Same as deleting, except instead of a "History erased" object you
> would use a "History archived" object with a little bit of string data
> to indicate which volume it's stored on (and where on the volume). When
> you stick that volume into the system you could easily tell the kernel
> to use it as an alternate for the given storage group.
>
> Q. What enforces data integrity?
> A. Ensure that a new tree object and its associated sub objects are on
> disk before you delete the old one. Doesn't need any actual full syncs
> at all, just barriers. If you replace the tree object before write-out
> is complete then just skip writing the old one and write the new one in
> its place.
>
> Q. What consists of a "commit"?
> A. Anything the administrator wants to define it as. Useful algorithms
> include: "Once per x Mbyte of page dirtying", "Once per 5 min", "Only
> when sync() or fsync() are called", "Only when gitfs-commit is called".
> You could even combine them: "Every x Mbyte of page dirtying or every 5
> minutes, whichever is shorter (or longer, depending on admin
> requirements)". There would also be appropriate syscalls to trigger
> appropriate git-like behavior. Network-accessible gitfs would want to
> have mechanisms to trigger commits based on activity on other systems
> (needs more thought).
>
> Q. How do you access old versions?
> A. Mount another instance of the filesystem with an SHA-1 ID, a
> tag-name, or a branch-name in a special mount option. Should be user
> accessible with some restrictions (needs more thought).
>
> Q. How do you deal with conflicts on networked filesystems.
> A. Once again, however the administrator wants to deal with them. Options:
> 1) Forcibly create a new branch for the conflicted tree.
> 2) Attempt to merge changes using the standard git-merge semantics
> 3) Merge independent changes to different files and pick one for
> changes to the same file
> 4) Your Algorithm Here(TM). GIT makes it easy to extend
> conflict-resolution.
>
> Q. How do you deal with little scattered changes in big (or sparse) files?
> A. Two questions, two answers: For sparse files, git would need
> extending to understand (and hash) the nature of the sparse-ness. For
> big files, you should be able to introduce a "compound-file" datatype
> and configure git to deal with specific X-Mbyte chunks of it
> independently. This might not be a bad idea for native git as well.
> Would need system-specific configuration.
>
> Q. How do you prevent massive data consumption by spurious tiny changes
> A. You have a few options:
> 1) Configure your commit algorithm as above to not commit so often
> 2) Configure a stepped commit-discard algorithm as described above
> in the "How do you delete things" question
> 3) Archive unused data to secondary storage more often
>
> Q. What about all the unanswered questions?
> A. These are all the ones I could think of off the top of my head but
> there are at least a hundred more. I'm pretty sure these are some of
> the most significant ones.
>
> Q. That's a great idea and I'll implement it right away!
> A. Yay! (but that's not a question :-D) Good luck and happy hacking.
>
> Q. That's a stupid idea and would never ever work!
> A. Thanks for your useful input! (but that's not a question either) I'm
> sure anybody who takes up a project like this will consider such opinions.
>
> Q. *flamage*
> A. I'm glad you have such strong opinions, feel free to to continue to
> spam my /dev/null device (and that's also not a question).
>
> All opinions and comments welcomed.
>
> Cheers,
> Kyle Moffett
>
>

It sounds brilliant and I'd love to have a got at implementing it but I
don't know enough (yet :-D) about how git works, a little research is
called for I think.

Jack
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Artem Bityutskiy: "Re: [PATCH/RFC] oops and panic message logging to MTD"
Previous message: Jiri Slaby: "Re: Question about a strange behavior of copy_to_user() in ioctlcall"
In reply to: Kyle Moffett: "Re: Versioning file system"
Next in thread: Bron Gondwana: "Re: Versioning file system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]