Re: [PATCH v2 2/6] composefs: Add on-disk layout

From: Alexander Larsson
Date: Tue Jan 17 2023 - 07:12:28 EST

Next message: Juergen Gross: "Re: [PATCH linux-next v3] x86/xen/time: prefer tsc as clocksource when it is invariant"
Previous message: Martin Wilck: "Re: regression on aarch64? panic on boot"
In reply to: Dave Chinner: "Re: [PATCH v2 2/6] composefs: Add on-disk layout"
Next in thread: Dave Chinner: "Re: [PATCH v2 2/6] composefs: Add on-disk layout"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 2023-01-17 at 10:06 +1100, Dave Chinner wrote:
> On Mon, Jan 16, 2023 at 12:00:03PM +0100, Alexander Larsson wrote:
> > On Mon, 2023-01-16 at 12:29 +1100, Dave Chinner wrote:
> > > On Fri, Jan 13, 2023 at 04:33:55PM +0100, Alexander Larsson
> > > wrote:
> > > > This commit adds the on-disk layout header file of composefs.
> > >
> > > This isn't really a useful commit message.
> > >
> > > Perhaps it should actually explain what the overall goals of the
> > > on-disk format are - space usage, complexity trade-offs,
> > > potential
> > > issues with validation of variable payload sections, etc.
> > >
> >
> > I agree, will flesh it out. But, as for below discussions, one of
> > the
> > overall goals is to keep the on-disk file size low.
> >
> > > > Signed-off-by: Alexander Larsson <alexl@xxxxxxxxxx>
> > > > Co-developed-by: Giuseppe Scrivano <gscrivan@xxxxxxxxxx>
> > > > Signed-off-by: Giuseppe Scrivano <gscrivan@xxxxxxxxxx>
> > > > ---
> > > > fs/composefs/cfs.h | 203
> > > > +++++++++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 203 insertions(+)
> > > > create mode 100644 fs/composefs/cfs.h
> > > >
> > > > diff --git a/fs/composefs/cfs.h b/fs/composefs/cfs.h
> > > > new file mode 100644
> > > > index 000000000000..658df728e366
> > > > --- /dev/null
> > > > +++ b/fs/composefs/cfs.h
> > > > @@ -0,0 +1,203 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +/*
> > > > + * composefs
> > > > + *
> > > > + * Copyright (C) 2021 Giuseppe Scrivano
> > > > + * Copyright (C) 2022 Alexander Larsson
> > > > + *
> > > > + * This file is released under the GPL.
> > > > + */
> > > > +
> > > > +#ifndef _CFS_H
> > > > +#define _CFS_H
> > > > +
> > > > +#include <asm/byteorder.h>
> > > > +#include <crypto/sha2.h>
> > > > +#include <linux/fs.h>
> > > > +#include <linux/stat.h>
> > > > +#include <linux/types.h>
> > > > +
> > > > +#define CFS_VERSION 1
> > >
> > > This should start with a description of the on-disk format for
> > > the
> > > version 1 format.
> >
> > There are some format descriptions in the later document patch.
> > What is
> > the general approach here, do we document in the header, or in
> > separate
> > doc file? For example, I don't see much of format descriptions in
> > the
> > xfs headers. I mean, I should probably add *some* info here for
> > easier
> > reading of the stuff below, but I don't feel like headers are a
> > great
> > place for docs.
>
> it's fine to describe the format in the docs, but when reading the
> code there needs to at least an overview of the structure the code
> is implementing so that the code makes some sense without having to
> go find the external place the format is documented.

Yeah, I'll try to make format in the next series be overall more
commented.

> > >
> > > > +#define CFS_MAGIC 0xc078629aU
> > > > +
> > > > +#define CFS_MAX_DIR_CHUNK_SIZE 4096
> > > > +#define CFS_MAX_XATTRS_SIZE 4096
> > >
> > > How do we store 64kB xattrs in this format if the max attr size
> > > is
> > > 4096 bytes? Or is that the maximum total xattr storage?
> >
> > This is a current limitation of the composefs file format.
>
> Yes, but is that 4kB limit the maximum size of a single xattr, or is
> it the total xattr storage space for an inode?

Currently it is actually the total xattrs storage. I've never seen any
container images or rootfs images in general use any large amount of
xattrs. However, given the below discussion on multi-page mappings
maybe its possible to easily drop this limit.

> > I am aware
> > that the kernel maximum size is 64k,
>
> For a single xattr, yes. Hence my question....
>
> > > > +static inline int cfs_digest_from_payload(const char *payload,
> > > > size_t payload_len,
> > > > +                                         u8
> > > > digest_out[SHA256_DIGEST_SIZE])
> .....
> > > Too big to be a inline function.
> >
> > Yeah, I'm aware of this. I mainly put it in the header as the
> > implementation of it is sort of part of the on-disk format. But, I
> > can
> > move it to a .c file instead.
>
> Please do - it's really part of the reader implementation, not the
> structure definition.
>
> > > > +struct cfs_vdata_s {
> > >
> > > Drop the "_s" suffix to indicate the type is a structure - that's
> > > waht "struct" tells us.
> >
> > Sure.
> >
> > > > +       u64 off;
> > > > +       u32 len;
> > >
> > > If these are on-disk format structures, why aren't the defined as
> > > using the specific endian they are encoded in? i.e. __le64,
> > > __le32,
> > > etc? Otherwise a file built on a big endian machine won't be
> > > readable on a little endian machine (and vice versa).
> >
> > On disk all fields are little endian. However, when we read them
> > from
> > disk we convert them using e.g. le32_to_cpu(), and then we use the
> > same
> > structure in memory, with native endian. So, it seems wrong to mark
> > them as little endian.
>
> Then these structures do not define "on-disk format". Looking a bit
> further through the patchset, these are largely intermediate
> structures that are read once to instatiate objects in memory, then
> never used again. The cfs_inode_s is a good example of this - I'll
> come back to that.

The header/superblock is actually just read from the fs as-is, as are
most of the other structures. Only the inode data is packed.

> > > > +} __packed;
> > > > +
> > > > +struct cfs_header_s {
> > > > +       u8 version;
> > > > +       u8 unused1;
> > > > +       u16 unused2;
> > >
> > > Why are you hyper-optimising these structures for minimal space
> > > usage? This is 2023 - we can use a __le32 for the version number,
> > > the magic number and then leave....
> > >
> > > > +
> > > > +       u32 magic;
> > > > +       u64 data_offset;
> > > > +       u64 root_inode;
> > > > +
> > > > +       u64 unused3[2];
> > >
> > > a whole heap of space to round it up to at least a CPU cacheline
> > > size using something like "__le64 unused[15]".
> > >
> > > That way we don't need packed structures nor do we care about
> > > having
> > > weird little holes in the structures to fill....
> >
> > Sure.
>
> FWIW, now I see how this is used, this header kinda defines what
> we'd call the superblock in the on-disk format of a filesystem. It's
> at a fixed location in the image file, so there should be a #define
> somewhere in this file to document it's fixed location.

It is at offset zero. I don't really think that needs a define, does
it? Maybe a comment though.

> Also, if this is the in-memory representation of the structure and
> not the actual on-disk format, why does it even need padding,
> packing or even store the magic number?

In this case it is the on-disk format though.

> i.e. this information could simply be stored in a few fields in the
> cfs
> superblock structure that wraps the vfs superblock, and the
> superblock read function could decode straight into those fields...

We just read this header from disk, validate the magic and then convert
the fields to native endian, then the few used fields (data_offset and
root_inode) to the vfs superblock structure.

> > > > +} __packed;
> > > > +
> > > > +enum cfs_inode_flags {
> > > > +       CFS_INODE_FLAGS_NONE = 0,
> > > > +       CFS_INODE_FLAGS_PAYLOAD = 1 << 0,
> > > > +       CFS_INODE_FLAGS_MODE = 1 << 1,
> > > > +       CFS_INODE_FLAGS_NLINK = 1 << 2,
> > > > +       CFS_INODE_FLAGS_UIDGID = 1 << 3,
> > > > +       CFS_INODE_FLAGS_RDEV = 1 << 4,
> > > > +       CFS_INODE_FLAGS_TIMES = 1 << 5,
> > > > +       CFS_INODE_FLAGS_TIMES_NSEC = 1 << 6,
> > > > +       CFS_INODE_FLAGS_LOW_SIZE = 1 << 7, /* Low 32bit of
> > > > st_size
> > > > */
> > > > +       CFS_INODE_FLAGS_HIGH_SIZE = 1 << 8, /* High 32bit of
> > > > st_size */
> > >
> > > Why do we need to complicate things by splitting the inode size
> > > like this?
> > >
> >
> > The goal is to minimize the image size for a typical rootfs or
> > container image. Almost zero files in any such images are > 4GB.
>
> Sure, but how much space does this typically save, versus how much
> complexity it adds to runtime decoding of inodes?
>
> I mean, in a dense container system the critical resources that need
> to be saved is runtime memory and CPU overhead of operations, not
> the storage space. Saving a 30-40 bytes of storage space per inode
> means a typical image might ber a few MB smaller, but given the
> image file is not storing data we're only talking about images the
> use maybe 500 bytes of data per inode. Storage space for images
> is not a limiting factor, nor is network transmission (because
> compression), so it comes back to runtime CPU and memory usage.

Here are some example sizes of composefs images with the current packed
inodes:

6.2M cs9-developer-rootfs.composefs
2.1M cs9-minimal-rootfs.composefs
1.2M fedora-37-container.composefs
433K ubuntu-22.04-container.composefs

If we set all the flags for the inodes (i.e. fixed size inodes) we get:

8.8M cs9-developer-rootfs.composefs
3.0M cs9-minimal-rootfs.composefs
1.6M fedora-37-container.composefs
625K ubuntu-22.04-container.composefs

So, images are about 40% larger with fixed size inodes.

> The inodes are decoded out of the page cache, so the memory for the
> raw inode information is volatile and reclaimed when needed.
> Similarly, the VFS inode built from this information is reclaimable
> when not in use, too. So the only real overhead for runtime is the
> decoding time to find the inode in the image file and then decode
> it.

I disagree with this characterization. It is true that page cache is
volatile, but if you can fit 40% less inode data in the page cache then
there is additional overhead where you need to read this from disk. So,
decoding time is not the only thing that affects overhead.

Additionally, just by being larger and less dense, more data has to be
read from disk, which itself is slower.

> Given the decoding of the inode -all branches- and is not
> straight-line code, it cannot be well optimised and the CPU branch
> predictor is not going to get it right every time. Straight line
> code that decodes every field whether it is zero or not is going to
> be faster.
>
> Further, with a fixed size inode in the image file, the inode table
> can be entirely fixed size, getting rid of the whole unaligned data
> retreival problem that code currently has (yes, all that
> "le32_to_cpu(__get_unaligned(__le32, data)" code) because we can
> ensure that all the inode fields are aligned in the data pages. This
> will significantly speed up decoding into the in-memory inode
> structures.

I agree it could be faster. But is inode decode actually the limiting
factor, compared to things like disk i/o or better use of page cache?

> And to take it another step, the entire struct cfs_inode_s structure
> could go away - it is entirely a temporary structure used to shuffle
> data from the on-disk encoded format to the the initialisation of
> the VFS inode. The on-disk inode data could be decoded directly into
> the VFS inode after it has been instantiated, rather than decoding
> the inode from the backing file and the instantiating the in-memory
> inode.
>
> i.e. instead of:
>
> cfs_lookup()
>         cfs_dir_lookup(&index)
>         cfs_get_ino_index(index, &inode_s)
>                 cfs_get_inode_data_max(index, &data)
>                 inode_s->st_.... = cfs_read_....(&data);
>                 inode_s->st_.... = cfs_read_....(&data);
>                 inode_s->st_.... = cfs_read_....(&data);
>                 inode_s->st_.... = cfs_read_....(&data);
>         cfs_make_inode(inode_s, &vfs_inode)
>                 inode = new_inode(sb)
>                 inode->i_... = inode_s->st_....;
>                 inode->i_... = inode_s->st_....;
>                 inode->i_... = inode_s->st_....;
>                 inode->i_... = inode_s->st_....;
>
> You could collapse this straight down to:
>
> cfs_lookup()
>         cfs_dir_lookup(&index)
>         cfs_make_inode(index, &vfs_inode)
>                 inode = new_inode(sb)
>                 cfs_get_inode_data_max(index, &data)
>                 inode->i_... = cfs_read_....(&data);
>                 inode->i_... = cfs_read_....(&data);
>                 inode->i_... = cfs_read_....(&data);
>                 inode->i_... = cfs_read_....(&data);
>
> This removes an intermediately layer from the inode instantiation
> fast path completely. ANd if the inode table is fixed size and
> always well aligned, then the cfs_make_inode() code that sets up the
> VFS inode is almost unchanged from what it is now. There are no new
> branches, the file image format is greatly simplified, and the
> runtime overhead of instantiating inodes is significantly reduced.

I'm not sure the performance win is clear compared to the extra size,
as generally inodes are only decoded once and kept around in memory for
most of its use. However, I agree that there are clear advantages in
simplifying the format. That makes it easier to maintain and
understand. I'll give this some thought.

> Similar things can be done with the rest of the "descriptor"
> abstraction - the intermediate in-memory structures can be placed
> directly in the cfs_inode structure that wraps the VFS inode, and
> the initialisation of them can call the decoding code directly
> instead of using intermediate structures as is currently done.
>
> This will remove a chunk of code from the implemenation and make it
> run faster....
>
> > Also, we don't just "not decode" the items with the flag not set,
> > they
> > are not even stored on disk.
>
> Yup, and I think that is a mistake - premature optimisation and all
> that...
>
> >
> > > > +       CFS_INODE_FLAGS_XATTRS = 1 << 9,
> > > > +       CFS_INODE_FLAGS_DIGEST = 1 << 10, /* fs-verity sha256
> > > > digest */
> > > > +       CFS_INODE_FLAGS_DIGEST_FROM_PAYLOAD = 1 << 11, /*
> > > > Compute
> > > > digest from payload */
> > > > +};
> > > > +
> > > > +#define CFS_INODE_FLAG_CHECK(_flag,
> > > > _name)                                     \
> > > > +       (((_flag) & (CFS_INODE_FLAGS_##_name)) != 0)
> > >
> > > Check what about a flag? If this is a "check that a feature is
> > > set",
> > > then open coding it better, but if you must do it like this, then
> > > please use static inline functions like:
> > >
> > >         if (cfs_inode_has_xattrs(inode->flags)) {
> > >                 .....
> > >         }
> > >
> >
> > The check is if the flag is set, so maybe CFS_INODE_FLAG_IS_SET is
> > a
> > better name. This is used only when decoding the on-disk version of
> > the
> > inode to the in memory one, which is a bunch of:
> >
> >         if (CFS_INODE_FLAG_CHECK(ino->flags, THE_FIELD))
> >                 ino->the_field = cfs_read_u32(&data);
> >         else
> >                 ino->the_field = THE_FIELD_DEFUALT;
> >
> > I can easily open-code these checks, although I'm not sure it makes
> > a
> > great difference either way.
>
> If they are used only once, then it should be open coded. But I
> think the whole "optional inode fields" stuff should just go away
> entirely at this point...
>
> > > > +#define CFS_INODE_DEFAULT_MODE 0100644
> > > > +#define CFS_INODE_DEFAULT_NLINK 1
> > > > +#define CFS_INODE_DEFAULT_NLINK_DIR 2
> > > > +#define CFS_INODE_DEFAULT_UIDGID 0
> > > > +#define CFS_INODE_DEFAULT_RDEV 0
> > > > +#define CFS_INODE_DEFAULT_TIMES 0
> > >
> > > Where do these get used? Are they on disk defaults or something
> > > else? (comment, please!)
> >
> > They are the defaults that are used when inode fields on disk are
> > missing. I'll add some comments.
>
> They go away entirely with fixed size on-disk inodes.
>
> > > > +       u32 st_mode; /* File type and mode. */
> > > > +       u32 st_nlink; /* Number of hard links, only for regular
> > > > files. */
> > > > +       u32 st_uid; /* User ID of owner. */
> > > > +       u32 st_gid; /* Group ID of owner. */
> > > > +       u32 st_rdev; /* Device ID (if special file). */
> > > > +       u64 st_size; /* Size of file, only used for regular
> > > > files
> > > > */
> > > > +
> > > > +       struct cfs_vdata_s xattrs; /* ref to variable data */
> > >
> > > This is in the payload that follows the inode? Is it included in
> > > the payload_length above?
> > >
> > > If not, where is this stuff located, how do we validate it points
> > > to
> > > the correct place in the on-disk format file, the xattrs belong
> > > to
> > > this specific inode, etc? I think that's kinda important to
> > > describe, because xattrs often contain important security
> > > information...
> >
> > No, all inodes are packed into the initial part of the file, each
> > containing a flags set, a variable size (from flags) chunk of fixed
> > size elements and an variable size payload. The payload is either
> > the
> > target symlink for symlinks, or the path of the backing file for
> > regular files.
>
> Ok, I think you need to stop calling that a "payload", then. It's
> the path name to the backing file. The backing file is only relevant
> for S_IFREG and S_IFLINK types - directories don't need path names
> as they only contain pointers to other inodes in the image file.
> Types like S_IFIFO, S_IFBLK, etc should not have backing files,
> either - they should just be instantiated as the correct type in the
> VFS inode and not require any backing file interactions at all...
>
> Hence I think this "payload" should be called something like
> "backing path" or something similar.

Yeah, that may be better.

>
> .....
>
> > > > +struct cfs_dir_s {
> > > > +       u32 n_chunks;
> > > > +       struct cfs_dir_chunk_s chunks[];
> > > > +} __packed;
> > >
> > > So directory data is packed in discrete chunks? Given that this
> > > is a
> > > static directory format, and the size of the directory is known
> > > at
> > > image creation time, why does the storage need to be chunked?
> >
> > We chunk the data such that each chunk fits inside a single page in
> > the
> > image file. I did this to make accessing image data directly from
> > the
> > page cache easier.
>
> Hmmmm. So you defined a -block size- that matched the x86-64 -page
> size- to avoid page cache issues. Now, what about ARM or POWER
> which has 64kB page sizes?
>
> IOWs, "page size" is not the same on all machines, whilst the
> on-disk format for a filesystem image needs to be the same on all
> machines. Hence it appears that this:
>
> > > > +#define CFS_MAX_DIR_CHUNK_SIZE 4096
>
> should actually be defined in terms of the block size for the
> filesystem image, and this size of these dir chunks should be
> recorded in the superblock of the filesystem image. That way it
> is clear that the image has a specific chunk size, and it also paves
> the way for supporting more efficient directory structures using
> larger-than-page size chunks in future.

Yes, its true that assuming a (min) 4k page size is wasteful on some
arches, but it would be hard to read a filesystem created for 64k pages
on a 4k page machine, which is not ideal. However, wrt your commend on
multi-page mappings, maybe we can just totally drop these limits. I'll
have a look at that.

> > If we had dirent data spanning multiple pages
> > then we would either need to map the pages consecutively (which
> > seems
> > hard/costly) or have complex in-kernel code to handle the case
> > where a
> > dirent straddles two pages.
>
> Actually pretty easy - we do this with XFS for multi-page directory
> buffers. We just use vm_map_ram() on a page array at the moment,
> but in the near future there will be other options based on
> multipage folios.
>
> That is, the page cache now stores folios rather than pages, and is
> capable of using contiguous multi-page folios in the cache. As a
> result, multipage folios could be used to cache multi-page
> structures in the page cache and efficiently map them as a whole.
>
> That mapping code isn't there yet - kmap_local_folio() only maps the
> page within the folio at the offset given - but the foundation is
> there for supporting this functionality natively....
>
> I certainly wouldn't be designing a new filesystem these days that
> has it's on-disk format constrained by the x86-64 4kB page size...

Yes, I agree. I'm gonna look at using multi-page mapping for both
dirents and xattr data, which should completely drop these limits, as
well as get rid of the dirent chunking.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
Alexander Larsson Red Hat,
Inc
alexl@xxxxxxxxxx alexander.larsson@xxxxxxxxx
He's a witless playboy firefighter fleeing from a secret government
programme. She's a ditzy streetsmart socialite from the wrong side of
the
tracks. They fight crime!

Next message: Juergen Gross: "Re: [PATCH linux-next v3] x86/xen/time: prefer tsc as clocksource when it is invariant"
Previous message: Martin Wilck: "Re: regression on aarch64? panic on boot"
In reply to: Dave Chinner: "Re: [PATCH v2 2/6] composefs: Add on-disk layout"
Next in thread: Dave Chinner: "Re: [PATCH v2 2/6] composefs: Add on-disk layout"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]