Re: [LSF/MM/BPF TOPIC] Image-based read-only filesystem: further use cases & directions

From: Zhang Yi
Date: Thu Feb 23 2023 - 22:10:18 EST


On 2023/1/9 16:43, Gao Xiang wrote:
> Hi folks,
>
> * Background *
>
> We've been continuously working on forming a useful read-only
> (immutable) image solution since the end of 2017 (as a part of our
> work) until now as everyone may know:  EROFS.
>
> Currently it has already successfully landed to (about) billions of
> Android-related devices, other types of embedded devices and containers
> with many vendors involved, and we've always been seeking more use
> cases such as incremental immutable rootfs, app sandboxes or packages
> (Android apk? with many duplicated libraries), dataset packages, etc.
>
> The reasons why we always do believe immutable images can benefit
> various use cases are:
>
>  - much easier for all vendors to ship/distribute/keep original signing
>    (golden) images to each instance;
>
>  - (combined with the writable layer such as overlayfs) easy to roll
>    back to the original shipped state or do incremental updates;
>
>  - easy to check data corruption or do data recovery (no matter
>    whether physical device or network errors);
>
>  - easy for real storage devices to do hardware write-protection for
>    immutable images;
>
>  - can do various offline algorithms (such as reduced metadata,
>    content-defined rolling hash deduplication, compression) to minimize
>    image sizes;
>
>  - initrd with FSDAX to avoid double caching with advantages above;
>
>  - and more.
>
> In 2019, a LSF/MM/BPF topic was put forward to show EROFS initial use
> cases [1] as the read-only Android rootfs of a single instance on
> resource-limited devices so that effective compression became quite
> important at that time.
>
>
> * Problem *
>
> In addition to enhance data compression for single-instance deployment,
> as a self-contained approach (so that all use cases can share the only
> _one_ signed image), we've also focusing on multiple instances (such as
> containers or apps, each image represents a complete filesystem tree)
> all together on one device with similar data recently years so that
> effective data deduplication, on-demand lazy pulling, page cache
> sharing among such different golden images became vital as well.
>
>
> * Current progresses *
>
> In order to resolve the challenges above, we've worked out:
>
>  - (v5.15) chunk-based inodes (to form inode extents) to do data
>    deduplication among a single image;
>
>  - (v5.16) multiple shared blobs (to keep content-defined data) in
>    addition to the primary blob (to keep filesystem metadata) for wider
>    deduplication across different images:
>
>  - (v5.19) file-based distribution by introducing in-kernel local
>    caching fscache and on-demand lazy pulling feature [2];
>
>  - (v6.1) shared domain to share such multiple shared blobs in
>    fscache mode [3];
>
>  - [RFC] preliminary page cache sharing between diffenent images [4].
>
>
> * Potential topics to discuss *
>
>  - data verification of different images with thousands (or more)
>    shared blobs [5];
>
>  - encryption with per-extent keys for confidential containers [5][6];
>
>  - current page cache sharing limitation due to mm reserve mapping and
>    finer (folio or page-based) page cache sharing among images/blobs
>    [4][7];
>
>  - more effective in-kernel local caching features for fscache such as
>    failover and daemonless;
>
>  - (wild preliminary ideas, maybe) overlayfs partial copy-up with
>    fscache as the upper layer in order to form a unique caching
>    subsystem for better space saving?
>

Hello Xiang and all,

We interested in these topic too. Our cloud products will also want to use
erofs + overlayfs as container's base image and want to do more researchs on
deduplication, page cache sharing and disk space saving, and I also have some
study on overlayfs partial copy-up feature. I hope we could have further
discussion on this topic in person.

Thanks,
Yi.


>  - FSDAX enhancements for initial ramdisk or other use cases;
>
>  - other issues when landing.
>
>
> Finally, if our efforts (or plans) also make sense to you, we do hope
> more people could join us, Thanks!
>
> [1] https://lore.kernel.org/r/f44b1696-2f73-3637-9964-d73e3d5832b7@xxxxxxxxxx
> [2] https://lore.kernel.org/r/Yoj1AcHoBPqir++H@debian
> [3] https://lore.kernel.org/r/20220918043456.147-1-zhujia.zj@xxxxxxxxxxxxx
> [4] https://lore.kernel.org/r/20230106125330.55529-1-jefflexu@xxxxxxxxxxxxxxxxx
> [5] https://lore.kernel.org/r/Y6KqpGscDV6u5AfQ@B-P7TQMD6M-0146.local
> [6] https://lwn.net/SubscriberLink/918893/4d389217f9b8d679
> [7] https://lwn.net/Articles/895907
>
> Thanks,
> Gao Xiang
> .