Re: Distributed storage. Move away from char device ioctls.

From: Kyle Moffett
Date: Sun Sep 16 2007 - 03:09:03 EST


On Sep 15, 2007, at 13:24:46, Andreas Dilger wrote:
On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote:
Yes, block device itself is not able to scale well, but it is the place for redundancy, since filesystem will just fail if underlying device does not work correctly and FS actually does not know about where it should place redundancy bits - it might happen to be the same broken disk, so I created a low-level device which distribute requests itself.

I actually think there is a place for this - and improvements are definitely welcome. Even Lustre needs block-device level redundancy currently, though we will be working to make Lustre- level redundancy available in the future (the problem is WAY harder than it seems at first glance, if you allow writeback caches at the clients and servers).

I really think that to get proper non-block-device-level filesystem redundancy you need to base it on something similar to the GIT model. Data replication is done in specific-sized chunks indexed by SHA-1 sum and you actually have a sort of "merge algorithm" for when local and remote changes differ. The OS would only implement a very limited list of merge algorithms, IE one of:

(A) Don't merge, each client gets its own branch and merges are manual
(B) Most recent changed version is made the master every X-seconds/ open/close/write/other-event.
(C) The tree at X (usually a particular client/server) is always used as the master when there are conflicts.

This lets you implement whatever replication policy you want: You can require that some files are replicated (cached) on *EVERY* system, you can require that other files are cached on at least X systems. You can say "this needs to be replicated on at least X% of the online systems, or at most Y". Moreover, the replication could be done pretty easily from userspace via a couple syscalls. You also automatically keep track of history with some default purge policy.

The main point is that for efficiency and speed things are *not* always replicated; this also allows for offline operation. You would of course have "userspace" merge drivers which notice that the tree on your laptop is not a subset/superset of the tree on your desktop and do various merges based on per-file metadata. My address-book, for example, would have a custom little merge program which knows about how to merge changes between two address book files, asking me useful questions along the way. Since a lot of this merging is mechanical, some of the code from GIT could easily be made into a "merge library" which knows how to do such things.

Moreover, this would allow me to have a "shared" root filesystem on my laptop and desktop. It would have 'sub-project'-type trees, so that "/" would be an independent branch on each system. "/etc" would be separate branches but manually merged git-style as I make changes. "/home/*" folders would be auto-created as separate subtrees so each user can version their own individually. Specific subfolders (like address-book, email, etc) would be adjusted by the GUI programs that manage them to be separate subtrees with manual- merging controlled by that GUI program.

Backups/dumps/archival of such a system would be easy. You would just need to clone the significant commits/trees/etc to a DVD and replace the old SHA-1-indexed objects to tiny "object-deleted" stubs; to rollback to an archived version you insert the DVD, "mount" it into the existing kernel SHA-1 index, and then mount the appropriate commit as a read-only volume somewhere to access. The same procedure would also work for wide-area-network backups and such.

The effective result would be the ability to do things like the following:
(A) Have my homedir synced between both systems mostly- automatically as I make changes to different files on both systems
(B) Easily have 2 copies of all my files, so if one system's disk goes kaput I can just re-clone from the other.
(C) Keep archived copies of the last 5 years worth of work, including change history, on a stack of DVDs.
(D) Synchronize work between locations over a relatively slow link without much work.

As long as files were indirectly indexed by sub-block SHA1 (with the index depth based on the size of the file), and each individually- SHA1-ed object could have references, you could trivially have a 4TB- sized file where you modify 4 bytes at a thousand random locations throughout the file and only have to update about 5MB worth of on- disk data. The actual overhead for that kind of operation under any existing filesystem would be 100% seek-dominated regardless whereas with this mechanism you would not directly be overwriting data and so you could append all the updates as a single 5MB chunk. Data reads would be much more seek-y, but you could trivially have an on-line defragmenter tool which notices fragmented commonly-accessed inode objects and creates non-fragmented copies before deleting the old ones.

There's a lot of other technical details which would need resolution in an actual implementation, but this is enough of a summary to give you the gist of the concept. Most likely there will be some major flaw which makes it impossible to produce reliably, but the concept contains the things I would be interested in for a real "networked filesystem".

Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/