Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

From: Gao Xiang
Date: Mon Jan 23 2023 - 18:59:18 EST




On 2023/1/24 01:56, Alexander Larsson wrote:
On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote:
On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@xxxxxxxxxx>
wrote:

Giuseppe Scrivano and I have recently been working on a new project
we
call composefs. This is the first time we propose this publically
and
we would like some feedback on it.


Hi Alexander,

I must say that I am a little bit puzzled by this v3.
Gao, Christian and myself asked you questions on v2
that are not mentioned in v3 at all.

I got lots of good feedback from Dave Chinner on V2 that caused rather
large changes to simplify the format. So I wanted the new version with
those changes out to continue that review. I think also having that
simplified version will be helpful for the general discussion.

To sum it up, please do not propose composefs without explaining
what are the barriers for achieving the exact same outcome with
the use of a read-only overlayfs with two lower layer -
uppermost with erofs containing the metadata files, which include
trusted.overlay.metacopy and trusted.overlay.redirect xattrs that
refer to the lowermost layer containing the content files.


...


I would say both versions of this can work. There are some minor
technical issues with the overlay option:

* To get actual verification of the backing files you would need to
add support to overlayfs for an "trusted.overlay.digest" xattrs, with
behaviour similar to composefs.

* mkfs.erofs doesn't support sparse files (not sure if the kernel code
does), which means it is not a good option for the backing all these
sparse files. Squashfs seems to support this though, so that is an
option.

EROFS support chunk-based files, you actually can use this feature to do
sparse files if really needed.

Currently Android use cases and OCI v1 both doesn't need this feature,
but you can simply use ext4, I don't think squashfs here is a good
option since it doesn't optimize anything about directory lookup.


However, the main issue I have with the overlayfs approach is that it
is sort of clumsy and over-complex. Basically, the composefs approach
is laser focused on read-only images, whereas the overlayfs approach
just chains together technologies that happen to work, but also do a
lot of other stuff. The result is that it is more work to use it, it
uses more kernel objects (mounts, dm devices, loopbacks) and it has
worse performance.

To measure performance I created a largish image (2.6 GB centos9
rootfs) and mounted it via composefs, as well as overlay-over-squashfs,
both backed by the same objects directory (on xfs).

If I clear all caches between each run, a `ls -lR` run on composefs
runs in around 700 msec:

# hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR cfs-mount"
Benchmark 1: ls -lR cfs-mount
Time (mean ± σ): 701.0 ms ± 21.9 ms [User: 153.6 ms, System: 373.3 ms]
Range (min … max): 662.3 ms … 725.3 ms 10 runs

Whereas same with overlayfs takes almost four times as long:

# hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR ovl-mount"
Benchmark 1: ls -lR ovl-mount
Time (mean ± σ): 2.738 s ± 0.029 s [User: 0.176 s, System: 1.688 s]
Range (min … max): 2.699 s … 2.787 s 10 runs

With page cache between runs the difference is smaller, but still
there:

# hyperfine "ls -lR cfs-mnt"
Benchmark 1: ls -lR cfs-mnt
Time (mean ± σ): 390.1 ms ± 3.7 ms [User: 140.9 ms, System: 247.1 ms]
Range (min … max): 381.5 ms … 393.9 ms 10 runs

vs

# hyperfine -i "ls -lR ovl-mount"
Benchmark 1: ls -lR ovl-mount
Time (mean ± σ): 431.5 ms ± 1.2 ms [User: 124.3 ms, System: 296.9 ms]
Range (min … max): 429.4 ms … 433.3 ms 10 runs

This isn't all that strange, as overlayfs does a lot more work for
each lookup, including multiple name lookups as well as several xattr
lookups, whereas composefs just does a single lookup in a pre-computed
table. But, given that we don't need any of the other features of
overlayfs here, this performance loss seems rather unnecessary.

You should use ext4 to make a try first.

Thanks,
Gao Xiang