Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

From: Amir Goldstein
Date: Tue Jan 24 2023 - 14:07:04 EST

Next message: Allen Pais: "Re: [PATCH 5.15 000/117] 5.15.90-rc2 review"
Previous message: Steven Rostedt: "Re: [PATCH] ftrace: Show a list of all functions that have ever been enabled"
In reply to: Gao Xiang: "Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem"
Next in thread: Dave Chinner: "Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@xxxxxxxxxx> wrote:
>
> On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote:
> > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@xxxxxxxxxx>
> > wrote:
> > >
> > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote:
> > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson
> > > > <alexl@xxxxxxxxxx>
> > > > wrote:
> > > > >
> > > > > Giuseppe Scrivano and I have recently been working on a new
> > > > > project
> > > > > we
> > > > > call composefs. This is the first time we propose this
> > > > > publically
> > > > > and
> > > > > we would like some feedback on it.
> > > > >
> > > >
> > > > Hi Alexander,
> > > >
> > > > I must say that I am a little bit puzzled by this v3.
> > > > Gao, Christian and myself asked you questions on v2
> > > > that are not mentioned in v3 at all.
> > >
> > > I got lots of good feedback from Dave Chinner on V2 that caused
> > > rather
> > > large changes to simplify the format. So I wanted the new version
> > > with
> > > those changes out to continue that review. I think also having that
> > > simplified version will be helpful for the general discussion.
> > >
> >
> > That's ok.
> > I was not puzzled about why you posted v3.
> > I was puzzled by why you did not mention anything about the
> > alternatives to adding a new filesystem that were discussed on
> > v2 and argue in favor of the new filesystem option.
> > If you post another version, please make sure to include a good
> > explanation for that.
>
> Sure, I will add something to the next version. But like, there was
> already a discussion about this, duplicating that discussion in the v3
> announcement when the v2->v3 changes are unrelated to it doesn't seem
> like it makes a ton of difference.
>
> > > > To sum it up, please do not propose composefs without explaining
> > > > what are the barriers for achieving the exact same outcome with
> > > > the use of a read-only overlayfs with two lower layer -
> > > > uppermost with erofs containing the metadata files, which include
> > > > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that
> > > > refer to the lowermost layer containing the content files.
> > >
> > > So, to be more precise, and so that everyone is on the same page,
> > > lemme
> > > state the two options in full.
> > >
> > > For both options, we have a directory "objects" with content-
> > > addressed
> > > backing files (i.e. files named by sha256). In this directory all
> > > files have fs-verity enabled. Additionally there is an image file
> > > which you downloaded to the system that somehow references the
> > > objects
> > > directory by relative filenames.
> > >
> > > Composefs option:
> > >
> > > The image file has fs-verity enabled. To use the image, you mount
> > > it
> > > with options "basedir=objects,digest=$imagedigest".
> > >
> > > Overlayfs option:
> > >
> > > The image file is a loopback image of a gpt disk with two
> > > partitions,
> > > one partition contains the dm-verity hashes, and the other
> > > contains
> > > some read-only filesystem.
> > >
> > > The read-only filesystem has regular versions of directories and
> > > symlinks, but for regular files it has sparse files with the
> > > xattrs
> > > "trusted.overlay.metacopy" and "trusted.overlay.redirect" set, the
> > > later containing a string like like "/de/adbeef..." referencing a
> > > backing file in the "objects" directory. In addition, the image
> > > also
> > > contains overlayfs whiteouts to cover any toplevel filenames from
> > > the
> > > objects directory that would otherwise appear if objects is used
> > > as
> > > a lower dir.
> > >
> > > To use this you loopback mount the file, and use dm-verity to set
> > > up
> > > the combined partitions, which you then mount somewhere. Then you
> > > mount an overlayfs with options:
> > > "metacopy=on,redirect_dir=follow,lowerdir=veritydev:objects"
> > >
> > > I would say both versions of this can work. There are some minor
> > > technical issues with the overlay option:
> > >
> > > * To get actual verification of the backing files you would need to
> > > add support to overlayfs for an "trusted.overlay.digest" xattrs,
> > > with
> > > behaviour similar to composefs.
> > >
> > > * mkfs.erofs doesn't support sparse files (not sure if the kernel
> > > code
> > > does), which means it is not a good option for the backing all
> > > these
> > > sparse files. Squashfs seems to support this though, so that is an
> > > option.
> > >
> >
> > Fair enough.
> > Wasn't expecting for things to work without any changes.
> > Let's first agree that these alone are not a good enough reason to
> > introduce a new filesystem.
> > Let's move on..
>
> Yeah.
>
> > > However, the main issue I have with the overlayfs approach is that
> > > it
> > > is sort of clumsy and over-complex. Basically, the composefs
> > > approach
> > > is laser focused on read-only images, whereas the overlayfs
> > > approach
> > > just chains together technologies that happen to work, but also do
> > > a
> > > lot of other stuff. The result is that it is more work to use it,
> > > it
> > > uses more kernel objects (mounts, dm devices, loopbacks) and it has
> >
> > Up to this point, it's just hand waving, and a bit annoying if I am
> > being honest.
> > overlayfs+metacopy feature were created for the containers use case
> > for very similar set of requirements - they do not just "happen to
> > work"
> > for the same use case.
> > Please stick to technical arguments when arguing in favor of the new
> > "laser focused" filesystem option.
> >
> > > worse performance.
> > >
> > > To measure performance I created a largish image (2.6 GB centos9
> > > rootfs) and mounted it via composefs, as well as overlay-over-
> > > squashfs,
> > > both backed by the same objects directory (on xfs).
> > >
> > > If I clear all caches between each run, a `ls -lR` run on composefs
> > > runs in around 700 msec:
> > >
> > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR cfs-
> > > mount"
> > > Benchmark 1: ls -lR cfs-mount
> > > Time (mean ± σ): 701.0 ms ± 21.9 ms [User: 153.6 ms,
> > > System: 373.3 ms]
> > > Range (min … max): 662.3 ms … 725.3 ms 10 runs
> > >
> > > Whereas same with overlayfs takes almost four times as long:
> >
> > No it is not overlayfs, it is overlayfs+squashfs, please stick to
> > facts.
> > As Gao wrote, squashfs does not optimize directory lookup.
> > You can run a test with ext4 for POC as Gao suggested.
> > I am sure that mkfs.erofs sparse file support can be added if needed.
>
> New measurements follow, they now include also erofs over loopback,
> although that isn't strictly fair, because that image is much larger
> due to the fact that it didn't store the files sparsely. It also
> includes a version where the topmost lower is directly on the backing
> xfs (i.e. not via loopback). I attached the scripts used to create the
> images and do the profiling in case anyone wants to reproduce.
>
> Here are the results (on x86-64, xfs base fs):
>
> overlayfs + loopback squashfs - uncached
> Benchmark 1: ls -lR mnt-ovl
> Time (mean ± σ): 2.483 s ± 0.029 s [User: 0.167 s, System: 1.656 s]
> Range (min … max): 2.427 s … 2.530 s 10 runs
>
> overlayfs + loopback squashfs - cached
> Benchmark 1: ls -lR mnt-ovl
> Time (mean ± σ): 429.2 ms ± 4.6 ms [User: 123.6 ms, System: 295.0 ms]
> Range (min … max): 421.2 ms … 435.3 ms 10 runs
>
> overlayfs + loopback ext4 - uncached
> Benchmark 1: ls -lR mnt-ovl
> Time (mean ± σ): 4.332 s ± 0.060 s [User: 0.204 s, System: 3.150 s]
> Range (min … max): 4.261 s … 4.442 s 10 runs
>
> overlayfs + loopback ext4 - cached
> Benchmark 1: ls -lR mnt-ovl
> Time (mean ± σ): 528.3 ms ± 4.0 ms [User: 143.4 ms, System: 381.2 ms]
> Range (min … max): 521.1 ms … 536.4 ms 10 runs
>
> overlayfs + loopback erofs - uncached
> Benchmark 1: ls -lR mnt-ovl
> Time (mean ± σ): 3.045 s ± 0.127 s [User: 0.198 s, System: 1.129 s]
> Range (min … max): 2.926 s … 3.338 s 10 runs
>
> overlayfs + loopback erofs - cached
> Benchmark 1: ls -lR mnt-ovl
> Time (mean ± σ): 516.9 ms ± 5.7 ms [User: 139.4 ms, System: 374.0 ms]
> Range (min … max): 503.6 ms … 521.9 ms 10 runs
>
> overlayfs + direct - uncached
> Benchmark 1: ls -lR mnt-ovl
> Time (mean ± σ): 2.562 s ± 0.028 s [User: 0.199 s, System: 1.129 s]
> Range (min … max): 2.497 s … 2.585 s 10 runs
>
> overlayfs + direct - cached
> Benchmark 1: ls -lR mnt-ovl
> Time (mean ± σ): 524.5 ms ± 1.6 ms [User: 148.7 ms, System: 372.2 ms]
> Range (min … max): 522.8 ms … 527.8 ms 10 runs
>
> composefs - uncached
> Benchmark 1: ls -lR mnt-fs
> Time (mean ± σ): 681.4 ms ± 14.1 ms [User: 154.4 ms, System: 369.9 ms]
> Range (min … max): 652.5 ms … 703.2 ms 10 runs
>
> composefs - cached
> Benchmark 1: ls -lR mnt-fs
> Time (mean ± σ): 390.8 ms ± 4.7 ms [User: 144.7 ms, System: 243.7 ms]
> Range (min … max): 382.8 ms … 399.1 ms 10 runs
>
> For the uncached case, composefs is still almost four times faster than
> the fastest overlay combo (squashfs), and the non-squashfs versions are
> strictly slower. For the cached case the difference is less (10%) but
> with similar order of performance.
>
> For size comparison, here are the resulting images:
>
> 8.6M large.composefs
> 2.5G large.erofs
> 200M large.ext4
> 2.6M large.squashfs
>

Nice.
Clearly, mkfs.ext4 and mkfs.erofs are not optimized for space.
Note that Android has make_ext4fs which can create a compact
ro ext4 image without a journal.
Found this project that builds it outside of Android, but did not test:
https://github.com/iglunix/make_ext4fs

> > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR ovl-
> > > mount"
> > > Benchmark 1: ls -lR ovl-mount
> > > Time (mean ± σ): 2.738 s ± 0.029 s [User: 0.176 s,
> > > System: 1.688 s]
> > > Range (min … max): 2.699 s … 2.787 s 10 runs
> > >
> > > With page cache between runs the difference is smaller, but still
> > > there:
> >
> > It is the dentry cache that mostly matters for this test and please
> > use hyerfine -w 1 to warmup dentry cache for correct measurement
> > of warm cache lookup.
>
> I'm not sure why the dentry cache case would be more important?
> Starting a new container will very often not have cached the image.
>
> To me the interesting case is for a new image, but with some existing
> page cache for the backing files directory. That seems to model staring
> a new image in an active container host, but its somewhat hard to test
> that case.
>

ok, you can argue that faster cold cache ls -lR is important
for starting new images.
I think you will be asked to show a real life container use case where
that benchmark really matters.

> > I guess these test runs started with warm cache? but it wasn't
> > mentioned explicitly.
>
> Yes, they were warm (because I ran the previous test before it). But,
> the new profile script explicitly adds -w 1.
>
> > > # hyperfine "ls -lR cfs-mnt"
> > > Benchmark 1: ls -lR cfs-mnt
> > > Time (mean ± σ): 390.1 ms ± 3.7 ms [User: 140.9 ms,
> > > System: 247.1 ms]
> > > Range (min … max): 381.5 ms … 393.9 ms 10 runs
> > >
> > > vs
> > >
> > > # hyperfine -i "ls -lR ovl-mount"
> > > Benchmark 1: ls -lR ovl-mount
> > > Time (mean ± σ): 431.5 ms ± 1.2 ms [User: 124.3 ms,
> > > System: 296.9 ms]
> > > Range (min … max): 429.4 ms … 433.3 ms 10 runs
> > >
> > > This isn't all that strange, as overlayfs does a lot more work for
> > > each lookup, including multiple name lookups as well as several
> > > xattr
> > > lookups, whereas composefs just does a single lookup in a pre-
> > > computed
> >
> > Seriously, "multiple name lookups"?
> > Overlayfs does exactly one lookup for anything but first level
> > subdirs
> > and for sparse files it does the exact same lookup in /objects as
> > composefs.
> > Enough with the hand waving please. Stick to hard facts.
>
> With the discussed layout, in a stat() call on a regular file,
> ovl_lookup() will do lookups on both the sparse file and the backing
> file, whereas cfs_dir_lookup() will just map some page cache pages and
> do a binary search.
>
> Of course if you actually open the file, then cfs_open_file() would do
> the equivalent lookups in /objects. But that is often not what happens,
> for example in "ls -l".
>
> Additionally, these extra lookups will cause extra memory use, as you
> need dentries and inodes for the erofs/squashfs inodes in addition to
> the overlay inodes.
>

I see. composefs is really very optimized for ls -lR.
Now only need to figure out if real users start a container and do ls -lR
without reading many files is a real life use case.

> > > table. But, given that we don't need any of the other features of
> > > overlayfs here, this performance loss seems rather unnecessary.
> > >
> > > I understand that there is a cost to adding more code, but
> > > efficiently
> > > supporting containers and other forms of read-only images is a
> > > pretty
> > > important usecase for Linux these days, and having something
> > > tailored
> > > for that seems pretty useful to me, even considering the code
> > > duplication.
> > >
> > >
> > >
> > > I also understand Cristians worry about stacking filesystem, having
> > > looked a bit more at the overlayfs code. But, since composefs
> > > doesn't
> > > really expose the metadata or vfs structure of the lower
> > > directories it
> > > is much simpler in a fundamental way.
> > >
> >
> > I agree that composefs is simpler than overlayfs and that its
> > security
> > model is simpler, but this is not the relevant question.
> > The question is what are the benefits to the prospect users of
> > composefs
> > that justify this new filesystem driver if overlayfs already
> > implements
> > the needed functionality.
> >
> > The only valid technical argument I could gather from your email is -
> > 10% performance improvement in warm cache ls -lR on a 2.6 GB
> > centos9 rootfs image compared to overlayfs+squashfs.
> >
> > I am not counting the cold cache results until we see results of
> > a modern ro-image fs.
>
> They are all strictly worse than squashfs in the above testing.
>

It's interesting to know why and if an optimized mkfs.erofs
mkfs.ext4 would have done any improvement.

> > Considering that most real life workloads include reading the data
> > and that most of the time inodes and dentries are cached, IMO,
> > the 10% ls -lR improvement is not a good enough reason
> > for a new "laser focused" filesystem driver.
> >
> > Correct me if I am wrong, but isn't the use case of ephemeral
> > containers require that composefs is layered under a writable tmpfs
> > using overlayfs?
> >
> > If that is the case then the warm cache comparison is incorrect
> > as well. To argue for the new filesystem you will need to compare
> > ls -lR of overlay{tmpfs,composefs,xfs} vs. overlay{tmpfs,erofs,xfs}
>
> That very much depends. For the ostree rootfs uscase there would be no
> writable layer, and for containers I'm personally primarily interested
> in "--readonly" containers (i.e. without an writable layer) in my
> current automobile/embedded work. For many container cases however,
> that is true, and no doubt that would make the overhead of overlayfs
> less of a issue.
>
> > Alexander,
> >
> > On a more personal note, I know this discussion has been a bit
> > stormy, but am not trying to fight you.
>
> I'm overall not getting a warm fuzzy feeling from this discussion.
> Getting weird complaints that I'm somehow "stealing" functions or weird
> "who did $foo first" arguments for instance. You haven't personally
> attacked me like that, but some of your comments can feel rather
> pointy, especially in the context of a stormy thread like this. I'm
> just not used to kernel development workflows, so have patience with me
> if I do things wrong.
>

Fair enough.
As long as the things that we discussed are duly
mentioned in future posts, I'll do my best to be less pointy.

> > I think that {mk,}composefs is a wonderful thing that will improve
> > the life of many users.
> > But mount -t composefs vs. mount -t overlayfs is insignificant
> > to those users, so we just need to figure out based on facts
> > and numbers, which is the best technical alternative.
>
> In reality things are never as easy as one thing strictly being
> technically best. There is always a multitude of considerations. Is
> composefs technically better if it uses less memory and performs better
> for a particular usecase? Or is overlayfs technically better because it
> is useful for more usecases and already exists? A judgement needs to be
> made depending on things like complexity/maintainability of the new fs,
> ease of use, measured performance differences, relative importance of
> particular performance measurements, and importance of the specific
> usecase.
>
> It is my belief that the advantages of composefs outweight the cost of
> the code duplication, but I understand the point of view of a
> maintainer of an existing codebase and that saying "no" is often the
> right thing. I will continue to try to argue for my point of view, but
> will try to make it as factual as possible.
>

Improving overlayfs and erofs has additional advantages -
improving performance and size of erofs image may benefit
many other users regardless of the ephemeral containers
use case, so indeed, there are many aspects to consider.

Thanks,
Amir.

Next message: Allen Pais: "Re: [PATCH 5.15 000/117] 5.15.90-rc2 review"
Previous message: Steven Rostedt: "Re: [PATCH] ftrace: Show a list of all functions that have ever been enabled"
In reply to: Gao Xiang: "Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem"
Next in thread: Dave Chinner: "Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]