Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

From: Giuseppe Scrivano
Date: Sun Jan 22 2023 - 04:33:50 EST


Giuseppe Scrivano <gscrivan@xxxxxxxxxx> writes:

> Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes:
>
>> On 2023/1/22 06:34, Giuseppe Scrivano wrote:
>>> Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes:
>>>
>>>> On 2023/1/22 00:19, Giuseppe Scrivano wrote:
>>>>> Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> writes:
>>>>>
>>>>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote:
>>>>>>> Hi Amir,
>>>>>>> Amir Goldstein <amir73il@xxxxxxxxx> writes:
>>>>>>>
>>>>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Alexander,
>>>>>>>>
>>>>>>>> I must say that I am a little bit puzzled by this v3.
>>>>>>>> Gao, Christian and myself asked you questions on v2
>>>>>>>> that are not mentioned in v3 at all.
>>>>>>>>
>>>>>>>> To sum it up, please do not propose composefs without explaining
>>>>>>>> what are the barriers for achieving the exact same outcome with
>>>>>>>> the use of a read-only overlayfs with two lower layer -
>>>>>>>> uppermost with erofs containing the metadata files, which include
>>>>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>>>>>>> to the lowermost layer containing the content files.
>>>>>>> I think Dave explained quite well why using overlay is not
>>>>>>> comparable to
>>>>>>> what composefs does.
>>>>>>> One big difference is that overlay still requires at least a syscall
>>>>>>> for
>>>>>>> each file in the image, and then we need the equivalent of "rm -rf" to
>>>>>>> clean it up. It is somehow acceptable for long-running services, but it
>>>>>>> is not for "serverless" containers where images/containers are created
>>>>>>> and destroyed frequently. So even in the case we already have all the
>>>>>>> image files available locally, we still need to create a checkout with
>>>>>>> the final structure we need for the image.
>>>>>>> I also don't see how overlay would solve the verified image problem.
>>>>>>> We
>>>>>>> would have the same problem we have today with fs-verity as it can only
>>>>>>> validate a single file but not the entire directory structure. Changes
>>>>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>>>>>>> xattrs won't be noticed.
>>>>>>> There are at the moment two ways to handle container images, both
>>>>>>> somehow
>>>>>>> guided by the available file systems in the kernel.
>>>>>>> - A single image mounted as a block device.
>>>>>>> - A list of tarballs (OCI image) that are unpacked and mounted as
>>>>>>> overlay layers.
>>>>>>> One big advantage of the block devices model is that you can use
>>>>>>> dm-verity, this is something we miss today with OCI container images
>>>>>>> that use overlay.
>>>>>>> What we are proposing with composefs is a way to have "dm-verity"
>>>>>>> style
>>>>>>> validation based on fs-verity and the possibility to share individual
>>>>>>> files instead of layers. These files can also be on different file
>>>>>>> systems, which is something not possible with the block device model.
>>>>>>
>>>>>> That is not a new idea honestly, including chain of trust. Even laterly
>>>>>> out-of-tree incremental fs using fs-verity for this as well, except that
>>>>>> it's in a real self-contained way.
>>>>>>
>>>>>>> The composefs manifest blob could be generated remotely and signed.
>>>>>>> A
>>>>>>> client would need just to validate the signature for the manifest blob
>>>>>>> and from there retrieve the files that are not in the local CAS (even
>>>>>>> from an insecure source) and mount directly the manifest file.
>>>>>>
>>>>>>
>>>>>> Back to the topic, after thinking something I have to make a
>>>>>> compliment for reference.
>>>>>>
>>>>>> First, EROFS had the same internal dissussion and decision at
>>>>>> that time almost _two years ago_ (June 2021), it means:
>>>>>>
>>>>>> a) Some internal people really suggested EROFS could develop
>>>>>> an entire new file-based in-kernel local cache subsystem
>>>>>> (as you called local CAS, whatever) with stackable file
>>>>>> interface so that the exist Nydus image service [1] (as
>>>>>> ostree, and maybe ostree can use it as well) don't need to
>>>>>> modify anything to use exist blobs;
>>>>>>
>>>>>> b) Reuse exist fscache/cachefiles;
>>>>>>
>>>>>> The reason why we (especially me) finally selected b) because:
>>>>>>
>>>>>> - see the people discussion of Google's original Incremental
>>>>>> FS topic [2] [3] in 2019, as Amir already mentioned. At
>>>>>> that time all fs folks really like to reuse exist subsystem
>>>>>> for in-kernel caching rather than reinvent another new
>>>>>> in-kernel wheel for local cache.
>>>>>>
>>>>>> [ Reinventing a new wheel is not hard (fs or caching), just
>>>>>> makes Linux more fragmented. Especially a new filesystem
>>>>>> is just proposed to generate images full of massive massive
>>>>>> new magical symlinks with *overriden* uid/gid/permissions
>>>>>> to replace regular files. ]
>>>>>>
>>>>>> - in-kernel cache implementation usually met several common
>>>>>> potential security issues; reusing exist subsystem can
>>>>>> make all fses addressed them and benefited from it.
>>>>>>
>>>>>> - Usually an exist widely-used userspace implementation is
>>>>>> never an excuse for a new in-kernel feature.
>>>>>>
>>>>>> Although David Howells is always quite busy these months to
>>>>>> develop new netfs interface, otherwise (we think) we should
>>>>>> already support failover, multiple daemon/dirs, daemonless and
>>>>>> more.
>>>>> we have not added any new cache system. overlay does "layer
>>>>> deduplication" and in similar way composefs does "file deduplication".
>>>>> That is not a built-in feature, it is just a side effect of how things
>>>>> are packed together.
>>>>> Using fscache seems like a good idea and it has many advantages but
>>>>> it
>>>>> is a centralized cache mechanism and it looks like a potential problem
>>>>> when you think about allowing mounts from a user namespace.
>>>>
>>>> I think Christian [1] had the same feeling of my own at that time:
>>>>
>>>> "I'm pretty skeptical of this plan whether we should add more filesystems
>>>> that are mountable by unprivileged users. FUSE and Overlayfs are
>>>> adventurous enough and they don't have their own on-disk format. The
>>>> track record of bugs exploitable due to userns isn't making this
>>>> very attractive."
>>>>
>>>> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use
>>>> dm-verity) as well, but it doesn't change _anything_ about concerns of
>>>> "allowing mounts from a user namespace".
>>> I've mentioned that as a potential feature we could add in future,
>>> given
>>> the simplicity of the format and that it uses a CAS for its data instead
>>> of fscache. Each user can have and use their own store to mount the
>>> images.
>>> At this point it is just a wish from userspace, as it would improve
>>> a
>>> few real use cases we have.
>>> Having the possibility to run containers without root privileges is
>>> a
>>> big deal for many users, look at Flatpak apps for example, or rootless
>>> Podman. Mounting and validating images would be a a big security
>>> improvement. It is something that is not possible at the moment as
>>> fs-verity doesn't cover the directory structure and dm-verity seems out
>>> of reach from a user namespace.
>>> Composefs delegates the entire logic of dealing with files to the
>>> underlying file system in a similar way to overlay.
>>> Forging the inode metadata from a user namespace mount doesn't look
>>> like an insurmountable problem as well since it is already possible
>>> with a FUSE filesystem.
>>> So the proposal/wish here is to have a very simple format, that at
>>> some
>>> point could be considered safe to mount from a user namespace, in
>>> addition to overlay and FUSE.
>>
>> My response is quite similar to
>> https://lore.kernel.org/r/CAJfpeguyajzHwhae=4PWLF4CUBorwFWeybO-xX6UBD2Ekg81fg@xxxxxxxxxxxxxx/
>
> I don't see how that applies to what I said about unprivileged mounts,
> except the part about lazy download where I agree with Miklos that
> should be handled through FUSE and that is something possible with
> composefs:
>
> mount -t composefs composefs -obasedir=/path/to/store:/mnt/fuse /mnt/cfs
>
> where /mnt/fuse is handled by a FUSE file system that takes care of
> loading the files from the remote server, and possibly write them to
> /path/to/store once they are completed.
>
> So each user could have their "lazy download" without interfering with
> other users or the centralized cache.
>
>>>
>>>>> As you know as I've contacted you, I've looked at EROFS in the past
>>>>> and tried to get our use cases to work with it before thinking about
>>>>> submitting composefs upstream.
>>>>> From what I could see EROFS and composefs use two different
>>>>> approaches
>>>>> to solve a similar problem, but it is not possible to do exactly with
>>>>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a
>>>>> block device that uses fscache, and composefs as an overlay for files
>>>>> instead of directories.
>>>>
>>>> I don't think so honestly. EROFS "Multiple device" feature is
>>>> actually "multiple blobs" feature if you really think "device"
>>>> is block device.
>>>>
>>>> Primary device -- primary blob -- "composefs manifest blob"
>>>> Blob device -- data blobs -- "composefs backing files"
>>>>
>>>> any difference?
>>> I wouldn't expect any substancial difference between two RO file
>>> systems.
>>> Please correct me if I am wrong: EROFS uses 16 bits for the blob
>>> device
>>> ID, so if we map each file to a single blob device we are kind of
>>> limited on how many files we can have.
>>
>> I was here just to represent "composefs manifest file" concept rather than
>> device ID.
>>
>>> Sure this is just an artificial limit and can be bumped in a future
>>> version but the major difference remains: EROFS uses the blob device
>>> through fscache while the composefs files are looked up in the specified
>>> repositories.
>>
>> No, fscache can also open any cookie when opening file. Again, even with
>> fscache, EROFS doesn't need to modify _any_ on-disk format to:
>>
>> - record a "cookie id" for such special "magical symlink" with a similar
>> symlink on-disk format (or whatever on-disk format with data, just with
>> a new on-disk flag);
>>
>> - open such "cookie id" on demand when opening such EROFS file just as
>> any other network fses. I don't think blob device is limited here.
>>
>> some difference now?
>
> recording the "cookie id" is done by a singleton userspace daemon that
> controls the cachefiles device and requires one operation for each file
> before the image can be mounted.
>
> Is that the case or I misunderstood something?
>
>>>
>>>>> Sure composefs is quite simple and you could embed the composefs
>>>>> features in EROFS and let EROFS behave as composefs when provided a
>>>>> similar manifest file. But how is that any better than having a
>>>>
>>>> EROFS always has such feature since v5.16, we called primary device,
>>>> or Nydus concept --- "bootstrap file".
>>>>
>>>>> separate implementation that does just one thing well instead of merging
>>>>> different paradigms together?
>>>>
>>>> It's exist fs on-disk compatible (people can deploy the same image
>>>> to wider scenarios), or you could modify/enhacnce any in-kernel local
>>>> fs to do so like I already suggested, such as enhancing "fs/romfs" and
>>>> make it maintained again due to this magic symlink feature
>>>>
>>>> (because composefs don't have other on-disk requirements other than
>>>> a symlink path and a SHA256 verity digest from its original
>>>> requirement. Any local fs can be enhanced like this.)
>>>>
>>>>>
>>>>>> I know that you guys repeatedly say it's a self-contained
>>>>>> stackable fs and has few code (the same words as Incfs
>>>>>> folks [3] said four years ago already), four reasons make it
>>>>>> weak IMHO:
>>>>>>
>>>>>> - I think core EROFS is about 2~3 kLOC as well if
>>>>>> compression, sysfs and fscache are all code-truncated.
>>>>>>
>>>>>> Also, it's always welcome that all people could submit
>>>>>> patches for cleaning up. I always do such cleanups
>>>>>> from time to time and makes it better.
>>>>>>
>>>>>> - "Few code lines" is somewhat weak because people do
>>>>>> develop new features, layout after upstream.
>>>>>>
>>>>>> Such claim is usually _NOT_ true in the future if you
>>>>>> guys do more to optimize performance, new layout or even
>>>>>> do your own lazy pulling with your local CAS codebase in
>>>>>> the future unless
>>>>>> you *promise* you once dump the code, and do bugfix
>>>>>> only like Christian said [4].
>>>>>>
>>>>>> From LWN.net comments, I do see the opposite
>>>>>> possibility that you'd like to develop new features
>>>>>> later.
>>>>>>
>>>>>> - In the past, all in-tree kernel filesystems were
>>>>>> designed and implemented without some user-space
>>>>>> specific indication, including Nydus and ostree (I did
>>>>>> see a lot of discussion between folks before in ociv2
>>>>>> brainstorm [5]).
>>>>> Since you are mentioning OCI:
>>>>> Potentially composefs can be the file system that enables something
>>>>> very
>>>>> close to "ociv2", but it won't need to be called v2 since it is
>>>>> completely compatible with the current OCI image format.
>>>>> It won't require a different image format, just a seekable tarball
>>>>> that
>>>>> is compatible with old "v1" clients and we need to provide the composefs
>>>>> manifest file.
>>>>
>>>> May I ask did you really look into what Nydus + EROFS already did (as you
>>>> mentioned we discussed before)?
>>>>
>>>> Your "composefs manifest file" is exactly "Nydus bootstrap file", see:
>>>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md
>>>>
>>>> "Rafs is a filesystem image containing a separated metadata blob and
>>>> several data-deduplicated content-addressable data blobs. In a typical
>>>> rafs filesystem, the metadata is stored in bootstrap while the data
>>>> is stored in blobfile.
>>>> ...
>>>>
>>>> bootstrap: The metadata is a merkle tree (I think that is typo, should be
>>>> filesystem tree) whose nodes represents a regular filesystem's
>>>> directory/file a leaf node refers to a file and contains hash value of
>>>> its file data.
>>>> Root node and internal nodes refer to directories and contain the
>>>> hash value
>>>> of their children nodes."
>>>>
>>>> Nydus is already supported "It won't require a different image format, just
>>>> a seekable tarball that is compatible with old "v1" clients and we need to
>>>> provide the composefs manifest file." feature in v2.2 and will be released
>>>> later.
>>> Nydus is not using a tarball compatible with OCI v1.
>>> It defines a media type
>>> "application/vnd.oci.image.layer.nydus.blob.v1", that
>>> means it is not compatible with existing clients that don't know about
>>> it and you need special handling for that.
>>
>> I am not sure what you're saying: "media type" is quite out of topic here.
>>
>> If you said "mkcomposefs" is done in the server side, what is the media
>> type of such manifest files?
>>
>> And why not Nydus cannot do in the same way?
>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md
>>
>
> I am not talking about the manifest or the bootstrap file, I am talking
> about the data blobs.
>
>>> Anyway, let's not bother LKML folks with these userspace details.
>>> It
>>> has no relevance to the kernel and what file systems do.
>>
>> I'd like to avoid, I did't say anything about userspace details, I just would
>> like to say
>> "merged filesystem tree is also _not_ a new idea of composefs"
>> not "media type", etc.
>>
>>>
>>>>> The seekable tarball allows individual files to be retrieved. OCI
>>>>> clients will not need to pull the entire tarball, but only the individual
>>>>> files that are not already present in the local CAS. They won't also need
>>>>> to create the overlay layout at all, as we do today, since it is already
>>>>> described with the composefs manifest file.
>>>>> The manifest is portable on different machines with different
>>>>> configurations, as you can use multiple CAS when mounting composefs.
>>>>> Some users might have a local CAS, some others could have a
>>>>> secondary
>>>>> CAS on a network file system and composefs support all these
>>>>> configurations with the same signed manifest file.
>>>>>
>>>>>> That is why EROFS selected exist in-kernel fscache and
>>>>>> made userspace Nydus adapt it:
>>>>>>
>>>>>> even (here called) manifest on-disk format ---
>>>>>> EROFS call primary device ---
>>>>>> they call Nydus bootstrap;
>>>>>>
>>>>>> I'm not sure why it becomes impossible for ... ($$$$).
>>>>> I am not sure what you mean, care to elaborate?
>>>>
>>>> I just meant these concepts are actually the same concept with
>>>> different names and:
>>>> Nydus is a 2020 stuff;
>>> CRFS[1] is 2019 stuff.
>>
>> Does CRFS have anything similiar to a merged filesystem tree?
>>
>> Here we talked about local CAS:
>> I have no idea CRFS has anything similar to it.
>
> yes it does and it uses it with a FUSE file system. So neither
> composefs nor EROFS have invented anything here.
>
> Anyway, does it really matter who made what first? I don't see how it
> helps to understand if there are relevant differences in composefs to
> justify its presence in the kernel.
>
>>>
>>>> EROFS + primary device is a 2021-mid stuff.
>>>>
>>>>>> In addition, if fscache is used, it can also use
>>>>>> fsverity_get_digest() to enable fsverity for non-on-demand
>>>>>> files.
>>>>>>
>>>>>> But again I think even Google's folks think that is
>>>>>> (somewhat) broken so that they added fs-verity to its incFS
>>>>>> in a self-contained way in Feb 2021 [6].
>>>>>>
>>>>>> Finally, again, I do hope a LSF/MM discussion for this new
>>>>>> overlay model (full of massive magical symlinks to override
>>>>>> permission.)
>>>>> you keep pointing it out but nobody is overriding any permission.
>>>>> The
>>>>> "symlinks" as you call them are just a way to refer to the payload files
>>>>> so they can be shared among different mounts. It is the same idea used
>>>>> by "overlay metacopy" and nobody is complaining about it being a
>>>>> security issue (because it is not).
>>>>
>>>> See overlay documentation clearly wrote such metacopy behavior:
>>>> https://docs.kernel.org/filesystems/overlayfs.html
>>>>
>>>> "
>>>> Do not use metacopy=on with untrusted upper/lower directories.
>>>> Otherwise it is possible that an attacker can create a handcrafted file
>>>> with appropriate REDIRECT and METACOPY xattrs, and gain access to file
>>>> on lower pointed by REDIRECT. This should not be possible on local
>>>> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But
>>>> it should be possible for untrusted layers like from a pen drive.
>>>> "
>>>>
>>>> Do we really need such behavior working on another fs especially with
>>>> on-disk format? At least Christian said,
>>>> "FUSE and Overlayfs are adventurous enough and they don't have their
>>>> own on-disk format."
>>> If users want to do something really weird then they can always find
>>> a
>>> way but the composefs lookup is limited under the directories specified
>>> at mount time, so it is not possible to access any file outside the
>>> repository.
>>>
>>>>> The files in the CAS are owned by the user that creates the mount,
>>>>> so
>>>>> there is no need to circumvent any permission check to access them.
>>>>> We use fs-verity for these files to make sure they are not modified by a
>>>>> malicious user that could get access to them (e.g. a container breakout).
>>>>
>>>> fs-verity is not always enforcing and it's broken here if fsverity is not
>>>> supported in underlay fses, that is another my arguable point.
>>> It is a trade-off. It is up to the user to pick a configuration
>>> that
>>> allows using fs-verity if they care about this feature.
>>
>> I don't think fsverity is optional with your plan.
>
> yes it is optional. without fs-verity it would behave the same as today
> with overlay mounts without any fs-verity.
>
> How does validation work in EROFS for files served from fscache and that
> are on a remote file system?

nevermind my last question, I guess it would still go through the block
device in EROFS.
This is clearly a point in favor of a block device approach that a
stacking file system like overlay or composefs cannot achieve without
support from the underlying file system.

>
>> I wrote this all because it seems I didn't mention the original motivation
>> to use fscache in v2: kernel already has such in-kernel local cache, and
>> people liked to use it in 2019 rather than another stackable way (as
>> mentioned in incremental fs thread.)
>
> still for us the stackable way works better.
>
>> Thanks,
>> Gao Xiang
>>
>>> Regards,
>>> Giuseppe
>>> [1] https://github.com/google/crfs