Re: [NOMERGE] [RFC PATCH 00/12] erofs: introduce erofs file system

From: Richard Weinberger
Date: Fri Jun 01 2018 - 03:48:23 EST


On Thu, May 31, 2018 at 1:06 PM, Gao Xiang <gaoxiang25@xxxxxxxxxx> wrote:
> Hi all,
>
> Read-only file systems are used in many cases, such as read-only storage media.
> We are now focusing on the Android device which several read-only partitions exist.
> Due to limited read-only solutions, a new read-only file system EROFS
> (Extendable Read-Only File System) is introduced.

In which sense is it extendable?

> As the other read-only file systems, several meta regions in generic file systems
> such as free space bitmap are omitted. But the difference is that EROFS focuses
> more on performance than purely on saving storage space as much as possible.
>
> Furthermore, we also add the compression support called z_erofs.
>
> Traditional file systems with the compression support use the fixed-sized input
> compression, the output compressed units could be arbitrary lengths.
> However, data is accessed in the block unit for block devices, which means
> (A) if the accessed compressed data is not buffered, some data read from
> the physical block cannot be further utilized, which is illustrated as follows:
>
> ++-----------++-----------++ ++-----------++-----------++
> ...|| || || ... || || || ... original data
> ++-----------++-----------++ ++-----------++-----------++
> \ / \ /
> \ / \ /
> \ / \ /
> ++---|-------++--|--------++ ++-----|----++--------|--++
> ||xxx| || |xxxxxxxx|| ... ||xxxxx| || |xx|| compressed data
> ++---|-------++--|--------++ ++-----|----++--------|--++
>
> The shadow regions read from the block device but cannot be used for decompression.
>
> (B) If the compressed data is also buffered, it will increase the memory overhead.
> Because these are compressed data, it cannot be directly used, and we don't know
> when the corresponding compressed blocks are accessed, which is not friendly to
> the random read.
>
> In order to reduce the proportion of the data which cannot be directly decompressed,
> larger compressed sizes are preferred to be selected, which is also not friendly to
> the random read.
>
> Erofs implements the compression in a different approach, the details of which will
> be discussed in the next section.
>
> In brief, the following points summarize our design at a high level:
>
> 1) Use page-sized blocks so that there are no buffer heads.
>
> 2) By introducing a more general inline data / xattr, metadata and small data have
> the opportunity to be read with the inode metadata at the same time.
>
> 3) Introduce another shared xattr region in order to store the common xattrs (eg.
> selinux labels) or xattrs too large to be suitable for meta inline.
>
> 4) Metadata and data could be mixed by design, so it could be more flexible for mkfs
> to organize files and data.
>
> 5) instead of using the fixed-sized input compression, we put forward a new fixed
> output compression to make the full use of IO (which means all data from IO can be
> decompressed), reduce the read amplification, improve random read and keep the
> relatively lower compression ratios, illustrated as follows:
>
>
> |---- varient-length extent ----|------ VLE ------|--- VLE ---|
> /> clusterofs /> clusterofs /> clusterofs /> clusterofs
> ++---|-------++-----------++---------|-++-----------++-|---------++-|
> ...|| | || || | || || | || | ... original data
> ++---|-------++-----------++---------|-++-----------++-|---------++-|
> ++->cluster<-++->cluster<-++->cluster<-++->cluster<-++->cluster<-++
> size size size size size
> \ / / /
> \ / / /
> \ / / /
> ++-----------++-----------++-----------++
> ... || || || || ... compressed clusters
> ++-----------++-----------++-----------++
> ++->cluster<-++->cluster<-++->cluster<-++
> size size size
>
> A cluster could have more than one blocks by design, but currently we only have the
> page-sized cluster implementation (page-sized fixed output compression can also have
> better compression ratio than fixed input compression).
>
> All compressed clusters have a fixed size but could be decompressed into extents with
> arbitrary lengths.
>
> In addition, if a buffered IO reads the following shadow region (x), we could make a more
> customized path (to replace generic_file_buffered_read) which only reads one compressed
> cluster and makes the partial page available.
> /> clusterofs
> ++---|-------++
> ...|| | xxxx || ...
> ||---|-------||
>
> Some numbers using fixed output compression (VLE, cluster size = block size = 4k) on
> the server and Android phone (kirin970 platform):
>
> Server (magnetic disk):
>
> compression EROFS seq read EXT4 seq read EROFS random read EXT4 random read
> ratio bw[MB/s] bw[MB/s] bw[MB/s] (20%) bw[MB/s] (20%)
>
> 4 480.3 502.5 69.8 11.1
> 10 472.3 503.3 56.4 10.0
> 15 457.6 495.3 47.0 10.9
> 26 401.5 511.2 34.7 11.1
> 35 389.1 512.5 28.0 11.0
> 48 375.4 496.5 23.2 10.6
> 53 370.2 512.0 21.8 11.0
> 66 349.2 512.0 19.0 11.4
> 76 310.5 497.3 17.3 11.6
> 85 301.2 512.0 16.0 11.0
> 94 292.7 496.5 14.6 11.1
> 100 538.9 512.0 11.4 10.8
>
> Kirin970 (A73 Big-core 2361Mhz, A53 little-core 0Mhz, DDR 1866Mhz):

What storage was used? An eMMC?

> compression EROFS seq read EXT4 seq read EROFS random read EXT4 random read
> ratio bw[MB/s] bw[MB/s] bw[MB/s] (20%) bw[MB/s] (20%)
>
> 4 546.7 544.3 157.7 57.9
> 10 535.7 521.0 152.7 62.0
> 15 529.0 520.3 125.0 65.0
> 26 418.0 526.3 97.6 63.7
> 35 367.7 511.7 89.0 63.7
> 48 415.7 500.7 78.2 61.2
> 53 423.0 566.7 72.8 62.9
> 66 334.3 537.3 69.8 58.3
> 76 387.3 546.0 65.2 56.0
> 85 306.3 546.0 63.8 57.7
> 94 345.0 589.7 59.2 49.9
> 100 579.7 556.7 62.1 57.7

How does it compare to existing read only filesystems, such as squashfs?

--
Thanks,
//richard