Hi Jeff,
On Wed, Oct 15, 2003 at 11:13:27AM -0400, Jeff Garzik wrote:
Josh and others should take a look at Plan9's venti file storage method -- archival storage is a series of unordered blocks, all of which are indexed by the sha1 hash of their contents. This magically coalesces all duplicate blocks by its very nature, including the loooooong runs of zeroes that you'll find in many filesystems. I bet savings on "all bytes in this block are zero" are worth a bunch right there.
I had a few ideas on the above.
if the zero blocks are the problem, there's a tool called zum that nukes
them and replaces them with holes. I use it sometime, example:
andrea@velociraptor:~> dd if=/dev/zero of=zero bs=1M count=100
100+0 records in
100+0 records out
andrea@velociraptor:~> ls -ls zero
102504 -rw-r--r-- 1 andrea andrea 104857600 2003-10-16 18:24 zero
andrea@velociraptor:~> ~/bin/i686/zum zero
zero [820032K] [1 link]
andrea@velociraptor:~> ls -ls zero
0 -rw-r--r-- 1 andrea andrea 104857600 2003-10-16 18:24 zero
andrea@velociraptor:~>
the hash to the data is interesting, but 1) you lose the zerocopy
behaviour for the I/O, it's like doing a checksum for all the data going to
disk that you normally would never do (except for the tiny files in reiserfs
with tail packing enabled, but that's not bulk I/O), 2) I wonder how much data
is really duplicate besides the "zero" holes trivially fixable in userspace
(modulo bzImage or similar where I'm unsure if the fs code in the bootloader
can handle holes ;).