Re: Transparent compression in the FS

From: John Bradford
Date: Fri Oct 17 2003 - 12:19:19 EST


Quote from Eli Carter <eli.carter@xxxxxxxx>:
> John Bradford wrote:
> >>>The upshot of all that would be that if you needed space, it would be
> >>>there, (just overwrite the uncompressed versions of files), but until
> >>>you do, you can access the uncompressed data quickly.
> >>>
> >>>You could even take it one step further, and compress files with gzip
> >>>by default, and re-compress them with bzip2 after long periods of
> >>>inactivity.
> >>
> >>Note that a file compressed with bzip2 is not necessarily smaller than
> >>the same file compressed with gzip. (It can be quite a bit larger in fact.)
> >
> >
> > Have you noticed that with real-life data, or only test cases?
>
> Real-life data. I don't remember the exact details for certain, but as
> best as I can recall: I was dealing with copies of output from build
> logs, telnet sessions, messages files, or the like (i.e. text) that were
> (many,) many MB in size (and probably highly repetitititititive). I
> wound up with a loop that compressed each file into a gzip and a bzip2,
> compared the sizes, and killed the larger. There were a number of .gz's
> that won. (I have also read that gzip is better at text compression
> whereas bzip2 is better at binary compression. No, I don't remember the
> source.)

Wow, I'm really suprised, I've always had good results with text,
although quite possibly not as repetitive as yours. I have noticed
that uncompressable data such as /dev/random is almost always expanded
to a greater extent with bzip2, which is why I asked.

> But that is immaterial... You have to deal with the case where the
> 'better' algorithm gives 'worse' results (by size). Keep in mind that
> some data won't compress at all (for a given algorithm), and winds up
> needing more space in the compressed form. (In which case we add a byte
> to say "this is not compressed" and keep the original form.)
>
> uncompressed -> gzip; gzip -> bzip2 would be by far the normal case
> But, sometimes gzip can't get it any smaller, or would increase the
> size. (Keep in mind we may be storing a file that is already compressed...)
>
> So your scheme needs to note when compression fails so it doesn't try
> again, so we see:
>
> uncompressed -> gzip or uncompressed(gzip failed)
> gzip -> bzip2 or gzip(bzip2 failed)
> uncompressed(gzip failed) -> bzip2 or uncompressed(bzip2 failed)

It might also be worth only using a much slower compressor if we get
at least N% better results. Literally saving 0.1% of the size at the
expense of 5x worse decompression time is possibly not worth while in
most cases.

> If it were me, I'd do it with one compression algorthim as a
> proof-of-concept, then add a second, and then generalize it to N cases
> (which would not be hard once the 2 cases was done).
>
> But I must say, I like your idea of keeping the uncompressed form around
> until we need the space. (I'd also want to track reads separately from
> writes.)

Yes - one write would be worth a lot of reads in terms of keeping it
uncompressed.

Heh, it might also help if you get a bad sector on one of the copies
:-).

John.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/