Re: Distributed storage. Move away from char device ioctls.
From: Kyle Moffett
Date: Sun Sep 16 2007 - 03:09:03 EST
On Sep 15, 2007, at 13:24:46, Andreas Dilger wrote:
On Sep 15, 2007 16:29 +0400, Evgeniy Polyakov wrote:
Yes, block device itself is not able to scale well, but it is the
place for redundancy, since filesystem will just fail if
underlying device does not work correctly and FS actually does not
know about where it should place redundancy bits - it might happen
to be the same broken disk, so I created a low-level device which
distribute requests itself.
I actually think there is a place for this - and improvements are
definitely welcome. Even Lustre needs block-device level
redundancy currently, though we will be working to make Lustre-
level redundancy available in the future (the problem is WAY harder
than it seems at first glance, if you allow writeback caches at the
clients and servers).
I really think that to get proper non-block-device-level filesystem
redundancy you need to base it on something similar to the GIT
model. Data replication is done in specific-sized chunks indexed by
SHA-1 sum and you actually have a sort of "merge algorithm" for when
local and remote changes differ. The OS would only implement a very
limited list of merge algorithms, IE one of:
(A) Don't merge, each client gets its own branch and merges are manual
(B) Most recent changed version is made the master every X-seconds/
open/close/write/other-event.
(C) The tree at X (usually a particular client/server) is always
used as the master when there are conflicts.
This lets you implement whatever replication policy you want: You
can require that some files are replicated (cached) on *EVERY*
system, you can require that other files are cached on at least X
systems. You can say "this needs to be replicated on at least X% of
the online systems, or at most Y". Moreover, the replication could
be done pretty easily from userspace via a couple syscalls. You also
automatically keep track of history with some default purge policy.
The main point is that for efficiency and speed things are *not*
always replicated; this also allows for offline operation. You would
of course have "userspace" merge drivers which notice that the tree
on your laptop is not a subset/superset of the tree on your desktop
and do various merges based on per-file metadata. My address-book,
for example, would have a custom little merge program which knows
about how to merge changes between two address book files, asking me
useful questions along the way. Since a lot of this merging is
mechanical, some of the code from GIT could easily be made into a
"merge library" which knows how to do such things.
Moreover, this would allow me to have a "shared" root filesystem on
my laptop and desktop. It would have 'sub-project'-type trees, so
that "/" would be an independent branch on each system. "/etc" would
be separate branches but manually merged git-style as I make
changes. "/home/*" folders would be auto-created as separate
subtrees so each user can version their own individually. Specific
subfolders (like address-book, email, etc) would be adjusted by the
GUI programs that manage them to be separate subtrees with manual-
merging controlled by that GUI program.
Backups/dumps/archival of such a system would be easy. You would
just need to clone the significant commits/trees/etc to a DVD and
replace the old SHA-1-indexed objects to tiny "object-deleted" stubs;
to rollback to an archived version you insert the DVD, "mount" it
into the existing kernel SHA-1 index, and then mount the appropriate
commit as a read-only volume somewhere to access. The same procedure
would also work for wide-area-network backups and such.
The effective result would be the ability to do things like the
following:
(A) Have my homedir synced between both systems mostly-
automatically as I make changes to different files on both systems
(B) Easily have 2 copies of all my files, so if one system's disk
goes kaput I can just re-clone from the other.
(C) Keep archived copies of the last 5 years worth of work,
including change history, on a stack of DVDs.
(D) Synchronize work between locations over a relatively slow
link without much work.
As long as files were indirectly indexed by sub-block SHA1 (with the
index depth based on the size of the file), and each individually-
SHA1-ed object could have references, you could trivially have a 4TB-
sized file where you modify 4 bytes at a thousand random locations
throughout the file and only have to update about 5MB worth of on-
disk data. The actual overhead for that kind of operation under any
existing filesystem would be 100% seek-dominated regardless whereas
with this mechanism you would not directly be overwriting data and so
you could append all the updates as a single 5MB chunk. Data reads
would be much more seek-y, but you could trivially have an on-line
defragmenter tool which notices fragmented commonly-accessed inode
objects and creates non-fragmented copies before deleting the old ones.
There's a lot of other technical details which would need resolution
in an actual implementation, but this is enough of a summary to give
you the gist of the concept. Most likely there will be some major
flaw which makes it impossible to produce reliably, but the concept
contains the things I would be interested in for a real "networked
filesystem".
Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/