Re: Kernel SCM saga..
From: Ingo Molnar
Date: Sun Apr 10 2005 - 06:36:45 EST
* David S. Miller <davem@xxxxxxxxxxxxx> wrote:
> On Fri, 8 Apr 2005 22:45:18 -0700 (PDT)
> Linus Torvalds <torvalds@xxxxxxxx> wrote:
> > Also, I don't want people editing repostitory files by hand. Sure, the
> > sha1 catches it, but still... I'd rather force the low-level ops to use
> > the proper helper routines. Which is why it's a raw zlib compressed blob,
> > not a gzipped file.
> I understand the arguments for compression, but I hate it for one
> simple reason: recovery is more difficult when you corrupt some
> file in your repository.
> It's happened to me more than once and I did lose data.
> Without compression, I might be able to recover if something
> causes a block of zeros to be written to the middle of some
> repository file. With compression, you pretty much just lose.
that depends on how you compress. You are perfectly right that with
default zlib compression, where you start the compression stream and
stop it at the end of the file, recovery in case of damage is very hard
for the portion that comes _after_ the damaged section. You'd have to
reconstruct the compression state which is akin to breaking a key.
But with zlib you can 'flush' the compression state every couple of
blocks and basically get the same recovery properties, at some very
minimal extra space cost (because when you flush out compression state
you get some extra padding bytes).
Flushing has another advantage as well: a small delta (even if it
increases/decreases the file size!) in the middle of a larger file will
still be compressed to the same output both before and after the change
area (modulo flush block size), which rsync can pick up just fine. (IIRC
that is one of the reasons why Debian, when compressing .deb's, does
zlib-flushes every couple of blocks, so that rsync/apt-get can pick up
partial .deb's as well.)
the zlib option is i think Z_PARTIAL_FLUSH, i'm using it in Tux to do
chunks of compression. The flushing cost ismax 12 bytes or so, so if
it's done every 4K we maximize the cost to 0.2%.
so flushing is both rsync-friendly and recovery-friendly.
(recovery isnt as simple as with plaintext, as you have to find the next
'block' and the block length will be inevitably variable. But it should
be pretty predictable, and tools might even exist.)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/