Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

From: Alexander Larsson
Date: Tue Jan 24 2023 - 08:14:56 EST


On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote:
> On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@xxxxxxxxxx>
> wrote:
> >
> > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote:
> > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson
> > > <alexl@xxxxxxxxxx>
> > > wrote:
> > > >
> > > > Giuseppe Scrivano and I have recently been working on a new
> > > > project
> > > > we
> > > > call composefs. This is the first time we propose this
> > > > publically
> > > > and
> > > > we would like some feedback on it.
> > > >
> > >
> > > Hi Alexander,
> > >
> > > I must say that I am a little bit puzzled by this v3.
> > > Gao, Christian and myself asked you questions on v2
> > > that are not mentioned in v3 at all.
> >
> > I got lots of good feedback from Dave Chinner on V2 that caused
> > rather
> > large changes to simplify the format. So I wanted the new version
> > with
> > those changes out to continue that review. I think also having that
> > simplified version will be helpful for the general discussion.
> >
>
> That's ok.
> I was not puzzled about why you posted v3.
> I was puzzled by why you did not mention anything about the
> alternatives to adding a new filesystem that were discussed on
> v2 and argue in favor of the new filesystem option.
> If you post another version, please make sure to include a good
> explanation for that.

Sure, I will add something to the next version. But like, there was
already a discussion about this, duplicating that discussion in the v3
announcement when the v2->v3 changes are unrelated to it doesn't seem
like it makes a ton of difference.

> > > To sum it up, please do not propose composefs without explaining
> > > what are the barriers for achieving the exact same outcome with
> > > the use of a read-only overlayfs with two lower layer -
> > > uppermost with erofs containing the metadata files, which include
> > > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that
> > > refer to the lowermost layer containing the content files.
> >
> > So, to be more precise, and so that everyone is on the same page,
> > lemme
> > state the two options in full.
> >
> > For both options, we have a directory "objects" with content-
> > addressed
> > backing files (i.e. files named by sha256). In this directory all
> > files have fs-verity enabled. Additionally there is an image file
> > which you downloaded to the system that somehow references the
> > objects
> > directory by relative filenames.
> >
> > Composefs option:
> >
> >  The image file has fs-verity enabled. To use the image, you mount
> > it
> >  with options "basedir=objects,digest=$imagedigest".
> >
> > Overlayfs option:
> >
> >  The image file is a loopback image of a gpt disk with two
> > partitions,
> >  one partition contains the dm-verity hashes, and the other
> > contains
> >  some read-only filesystem.
> >
> >  The read-only filesystem has regular versions of directories and
> >  symlinks, but for regular files it has sparse files with the
> > xattrs
> >  "trusted.overlay.metacopy" and "trusted.overlay.redirect" set, the
> >  later containing a string like like "/de/adbeef..." referencing a
> >  backing file in the "objects" directory. In addition, the image
> > also
> >  contains overlayfs whiteouts to cover any toplevel filenames from
> > the
> >  objects directory that would otherwise appear if objects is used
> > as
> >  a lower dir.
> >
> >  To use this you loopback mount the file, and use dm-verity to set
> > up
> >  the combined partitions, which you then mount somewhere. Then you
> >  mount an overlayfs with options:
> >   "metacopy=on,redirect_dir=follow,lowerdir=veritydev:objects"
> >
> > I would say both versions of this can work. There are some minor
> > technical issues with the overlay option:
> >
> > * To get actual verification of the backing files you would need to
> > add support to overlayfs for an "trusted.overlay.digest" xattrs,
> > with
> > behaviour similar to composefs.
> >
> > * mkfs.erofs doesn't support sparse files (not sure if the kernel
> > code
> > does), which means it is not a good option for the backing all
> > these
> > sparse files. Squashfs seems to support this though, so that is an
> > option.
> >
>
> Fair enough.
> Wasn't expecting for things to work without any changes.
> Let's first agree that these alone are not a good enough reason to
> introduce a new filesystem.
> Let's move on..

Yeah.

> > However, the main issue I have with the overlayfs approach is that
> > it
> > is sort of clumsy and over-complex. Basically, the composefs
> > approach
> > is laser focused on read-only images, whereas the overlayfs
> > approach
> > just chains together technologies that happen to work, but also do
> > a
> > lot of other stuff. The result is that it is more work to use it,
> > it
> > uses more kernel objects (mounts, dm devices, loopbacks) and it has
>
> Up to this point, it's just hand waving, and a bit annoying if I am
> being honest.
> overlayfs+metacopy feature were created for the containers use case
> for very similar set of requirements - they do not just "happen to
> work"
> for the same use case.
> Please stick to technical arguments when arguing in favor of the new
> "laser focused" filesystem option.
>
> > worse performance.
> >
> > To measure performance I created a largish image (2.6 GB centos9
> > rootfs) and mounted it via composefs, as well as overlay-over-
> > squashfs,
> > both backed by the same objects directory (on xfs).
> >
> > If I clear all caches between each run, a `ls -lR` run on composefs
> > runs in around 700 msec:
> >
> > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR cfs-
> > mount"
> > Benchmark 1: ls -lR cfs-mount
> >   Time (mean ± σ):     701.0 ms ±  21.9 ms    [User: 153.6 ms,
> > System: 373.3 ms]
> >   Range (min … max):   662.3 ms … 725.3 ms    10 runs
> >
> > Whereas same with overlayfs takes almost four times as long:
>
> No it is not overlayfs, it is overlayfs+squashfs, please stick to
> facts.
> As Gao wrote, squashfs does not optimize directory lookup.
> You can run a test with ext4 for POC as Gao suggested.
> I am sure that mkfs.erofs sparse file support can be added if needed.

New measurements follow, they now include also erofs over loopback,
although that isn't strictly fair, because that image is much larger
due to the fact that it didn't store the files sparsely. It also
includes a version where the topmost lower is directly on the backing
xfs (i.e. not via loopback). I attached the scripts used to create the
images and do the profiling in case anyone wants to reproduce.

Here are the results (on x86-64, xfs base fs):

overlayfs + loopback squashfs - uncached
Benchmark 1: ls -lR mnt-ovl
Time (mean ± σ): 2.483 s ± 0.029 s [User: 0.167 s, System: 1.656 s]
Range (min … max): 2.427 s … 2.530 s 10 runs

overlayfs + loopback squashfs - cached
Benchmark 1: ls -lR mnt-ovl
Time (mean ± σ): 429.2 ms ± 4.6 ms [User: 123.6 ms, System: 295.0 ms]
Range (min … max): 421.2 ms … 435.3 ms 10 runs

overlayfs + loopback ext4 - uncached
Benchmark 1: ls -lR mnt-ovl
Time (mean ± σ): 4.332 s ± 0.060 s [User: 0.204 s, System: 3.150 s]
Range (min … max): 4.261 s … 4.442 s 10 runs

overlayfs + loopback ext4 - cached
Benchmark 1: ls -lR mnt-ovl
Time (mean ± σ): 528.3 ms ± 4.0 ms [User: 143.4 ms, System: 381.2 ms]
Range (min … max): 521.1 ms … 536.4 ms 10 runs

overlayfs + loopback erofs - uncached
Benchmark 1: ls -lR mnt-ovl
Time (mean ± σ): 3.045 s ± 0.127 s [User: 0.198 s, System: 1.129 s]
Range (min … max): 2.926 s … 3.338 s 10 runs

overlayfs + loopback erofs - cached
Benchmark 1: ls -lR mnt-ovl
Time (mean ± σ): 516.9 ms ± 5.7 ms [User: 139.4 ms, System: 374.0 ms]
Range (min … max): 503.6 ms … 521.9 ms 10 runs

overlayfs + direct - uncached
Benchmark 1: ls -lR mnt-ovl
Time (mean ± σ): 2.562 s ± 0.028 s [User: 0.199 s, System: 1.129 s]
Range (min … max): 2.497 s … 2.585 s 10 runs

overlayfs + direct - cached
Benchmark 1: ls -lR mnt-ovl
Time (mean ± σ): 524.5 ms ± 1.6 ms [User: 148.7 ms, System: 372.2 ms]
Range (min … max): 522.8 ms … 527.8 ms 10 runs

composefs - uncached
Benchmark 1: ls -lR mnt-fs
Time (mean ± σ): 681.4 ms ± 14.1 ms [User: 154.4 ms, System: 369.9 ms]
Range (min … max): 652.5 ms … 703.2 ms 10 runs

composefs - cached
Benchmark 1: ls -lR mnt-fs
Time (mean ± σ): 390.8 ms ± 4.7 ms [User: 144.7 ms, System: 243.7 ms]
Range (min … max): 382.8 ms … 399.1 ms 10 runs

For the uncached case, composefs is still almost four times faster than
the fastest overlay combo (squashfs), and the non-squashfs versions are
strictly slower. For the cached case the difference is less (10%) but
with similar order of performance.

For size comparison, here are the resulting images:

8.6M large.composefs
2.5G large.erofs
200M large.ext4
2.6M large.squashfs

> > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR ovl-
> > mount"
> > Benchmark 1: ls -lR ovl-mount
> >   Time (mean ± σ):      2.738 s ±  0.029 s    [User: 0.176 s,
> > System: 1.688 s]
> >   Range (min … max):    2.699 s …  2.787 s    10 runs
> >
> > With page cache between runs the difference is smaller, but still
> > there:
>
> It is the dentry cache that mostly matters for this test and please
> use hyerfine -w 1 to warmup dentry cache for correct measurement
> of warm cache lookup.

I'm not sure why the dentry cache case would be more important?
Starting a new container will very often not have cached the image.

To me the interesting case is for a new image, but with some existing
page cache for the backing files directory. That seems to model staring
a new image in an active container host, but its somewhat hard to test
that case.

> I guess these test runs started with warm cache? but it wasn't
> mentioned explicitly.

Yes, they were warm (because I ran the previous test before it). But,
the new profile script explicitly adds -w 1.

> > # hyperfine "ls -lR cfs-mnt"
> > Benchmark 1: ls -lR cfs-mnt
> >   Time (mean ± σ):     390.1 ms ±   3.7 ms    [User: 140.9 ms,
> > System: 247.1 ms]
> >   Range (min … max):   381.5 ms … 393.9 ms    10 runs
> >
> > vs
> >
> > # hyperfine -i "ls -lR ovl-mount"
> > Benchmark 1: ls -lR ovl-mount
> >   Time (mean ± σ):     431.5 ms ±   1.2 ms    [User: 124.3 ms,
> > System: 296.9 ms]
> >   Range (min … max):   429.4 ms … 433.3 ms    10 runs
> >
> > This isn't all that strange, as overlayfs does a lot more work for
> > each lookup, including multiple name lookups as well as several
> > xattr
> > lookups, whereas composefs just does a single lookup in a pre-
> > computed
>
> Seriously, "multiple name lookups"?
> Overlayfs does exactly one lookup for anything but first level
> subdirs
> and for sparse files it does the exact same lookup in /objects as
> composefs.
> Enough with the hand waving please. Stick to hard facts.

With the discussed layout, in a stat() call on a regular file,
ovl_lookup() will do lookups on both the sparse file and the backing
file, whereas cfs_dir_lookup() will just map some page cache pages and
do a binary search.

Of course if you actually open the file, then cfs_open_file() would do
the equivalent lookups in /objects. But that is often not what happens,
for example in "ls -l".

Additionally, these extra lookups will cause extra memory use, as you
need dentries and inodes for the erofs/squashfs inodes in addition to
the overlay inodes.

> > table. But, given that we don't need any of the other features of
> > overlayfs here, this performance loss seems rather unnecessary.
> >
> > I understand that there is a cost to adding more code, but
> > efficiently
> > supporting containers and other forms of read-only images is a
> > pretty
> > important usecase for Linux these days, and having something
> > tailored
> > for that seems pretty useful to me, even considering the code
> > duplication.
> >
> >
> >
> > I also understand Cristians worry about stacking filesystem, having
> > looked a bit more at the overlayfs code. But, since composefs
> > doesn't
> > really expose the metadata or vfs structure of the lower
> > directories it
> > is much simpler in a fundamental way.
> >
>
> I agree that composefs is simpler than overlayfs and that its
> security
> model is simpler, but this is not the relevant question.
> The question is what are the benefits to the prospect users of
> composefs
> that justify this new filesystem driver if overlayfs already
> implements
> the needed functionality.
>
> The only valid technical argument I could gather from your email is -
> 10% performance improvement in warm cache ls -lR on a 2.6 GB
> centos9 rootfs image compared to overlayfs+squashfs.
>
> I am not counting the cold cache results until we see results of
> a modern ro-image fs.

They are all strictly worse than squashfs in the above testing.

> Considering that most real life workloads include reading the data
> and that most of the time inodes and dentries are cached, IMO,
> the 10% ls -lR improvement is not a good enough reason
> for a new "laser focused" filesystem driver.
>
> Correct me if I am wrong, but isn't the use case of ephemeral
> containers require that composefs is layered under a writable tmpfs
> using overlayfs?
>
> If that is the case then the warm cache comparison is incorrect
> as well. To argue for the new filesystem you will need to compare
> ls -lR of overlay{tmpfs,composefs,xfs} vs. overlay{tmpfs,erofs,xfs}

That very much depends. For the ostree rootfs uscase there would be no
writable layer, and for containers I'm personally primarily interested
in "--readonly" containers (i.e. without an writable layer) in my
current automobile/embedded work. For many container cases however,
that is true, and no doubt that would make the overhead of overlayfs
less of a issue.

> Alexander,
>
> On a more personal note, I know this discussion has been a bit
> stormy, but am not trying to fight you.

I'm overall not getting a warm fuzzy feeling from this discussion.
Getting weird complaints that I'm somehow "stealing" functions or weird
"who did $foo first" arguments for instance. You haven't personally
attacked me like that, but some of your comments can feel rather
pointy, especially in the context of a stormy thread like this. I'm
just not used to kernel development workflows, so have patience with me
if I do things wrong.

> I think that {mk,}composefs is a wonderful thing that will improve
> the life of many users.
> But mount -t composefs vs. mount -t overlayfs is insignificant
> to those users, so we just need to figure out based on facts
> and numbers, which is the best technical alternative.

In reality things are never as easy as one thing strictly being
technically best. There is always a multitude of considerations. Is
composefs technically better if it uses less memory and performs better
for a particular usecase? Or is overlayfs technically better because it
is useful for more usecases and already exists? A judgement needs to be
made depending on things like complexity/maintainability of the new fs,
ease of use, measured performance differences, relative importance of
particular performance measurements, and importance of the specific
usecase.

It is my belief that the advantages of composefs outweight the cost of
the code duplication, but I understand the point of view of a
maintainer of an existing codebase and that saying "no" is often the
right thing. I will continue to try to argue for my point of view, but
will try to make it as factual as possible.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
Alexander Larsson Red Hat,
Inc
alexl@xxxxxxxxxx alexander.larsson@xxxxxxxxx
He's a shy shark-wrestling librarian whom everyone believes is mad.
She's
an enchanted tempestuous stripper operating on the wrong side of the
law.
They fight crime!

Attachment: mkhack.sh
Description: application/shellscript

Attachment: profile.sh
Description: application/shellscript