Re: [PATCH 2/3] init/do_cmounts.c: introduce 'user_root' for initramfs

From: Luis Chamberlain
Date: Mon May 24 2021 - 20:44:29 EST


Cc'ing Josh as I think he might be interested in this.

On Sat, May 22, 2021 at 07:31:54PM +0800, menglong8.dong@xxxxxxxxx wrote:
> From: Menglong Dong <dong.menglong@xxxxxxxxxx>
>
> During the kernel initialization, the root of the mount tree is
> created with file system type of ramfs or tmpfs.

ramfs (initrd)

> While using initramfs as the root file system, cpio file is unpacked
> into the rootfs. Thus, this rootfs is exactly what users see in user
> space, and some problems arose: this rootfs has no parent mount,
> which make it can't be umounted or pivot_root.
> 'pivot_root' is used to change the rootfs and clean the old mounts,
> and it is essential for some users, such as docker. Docker use
> 'pivot_root' to change the root fs of a process if the current root
> fs is a block device of initrd. However, when it comes to initramfs,
> things is different: docker has to use 'chroot()' to change the root
> fs, as 'pivot_root()' is not supported in initramfs.
>
> The usage of 'chroot()' to create root fs for a container introduced
> a lot problems.
>
> First, 'chroot()' can't clean the old mountpoints which inherited
> from the host. It means that the mountpoints in host will have a
> duplicate in every container. Users may remove a USB after it
> is umounted successfully in the host. However, the USB may still
> be mounted in containers, although it can't be seen by the 'mount'
> commond. This means the USB is not released yet, and data may not
> write back. Therefore, data lose arise.
>
> Second, net-namespace leak is another problem. The net-namespace
> of containers will be mounted in /var/run/docker/netns/ in host
> by dockerd. It means that the net-namespace of a container will
> be mounted in containers which are created after it. Things
> become worse now, as the net-namespace can't be remove after
> the destroy of that container, as it is still mounted in other
> containers. If users want to recreate that container, he will
> fail if a certain mac address is to be binded with the container,
> as it is not release yet.

I think you can clarify this a bit more with:

If using container platforms such as Docker, upon initialization it
wants to use pivot_root() so that currently mounted devices do not
propagate to containers. An example of value in this is that
a USB device connected prior to the creation of a containers on the
host gets disconnected after a container is created; if the
USB device was mounted on containers, but already removed and
umounted on the host, the mount point will not go away untill all
containers unmount the USB device.

Another reason for container platforms such as Docker to use pivot_root
is that upon initialization the net-namspace is mounted under
/var/run/docker/netns/ on the host by dockerd. Without pivot_root
Docker must either wait to create the network namespace prior to
the creation of containers or simply deal with leaking this to each
container.

pivot_root is supported if the rootfs is a ramfs (initrd), but its
not supported if the rootfs uses an initramfs (tmpfs). This means
container platforms today must resort to using ramfs (initrd) if
they want to pivot_root from the rootfs. A workaround to use chroot()
is not a clean viable option given every container will have a
duplicate of every mount point on the host.

In order to support using container platforms such as Docker on
all the supported rootfs types we must extend Linux to support
pivot_root on initramfs as well. This patch does the work to do
just that.

So remind me.. so it would seem that if the rootfs uses a ramfs (initrd)
that pivot_root works just fine. Why is that? Did someone add support
for that? Has that always been the case that it works? If not, was it a
consequence of how ramfs (initrd) works?

And finally, why can't we share the same mechanism used for ramfs
(initrd) for initramfs (tmpfs)?

> In this patch, a second mount, which is called 'user root', is
> created before 'cpio' unpacking. The file system of 'user root'
> is exactly the same as rootfs, and both ramfs and tmpfs are
> supported. Then, the 'cpio' is unpacked into the 'user root'.
> Now, the rootfs has a parent mount, and pivot_root() will be
> supported for initramfs.

How about something like:

In order to support pivot_root on initramfs we introduce a second
"user_root" mount which is created before we do the cpio unpacking.
The filesystem of the "user_root" mount is the same the rootfs.

It begs the question, why add this infrastructure to suppor this for
ramfs (initrd) if we only need this hack for initramfs (tmpfs)?

> What's more, after this patch, 'rootflags' in boot cmd is supported
> by initramfs. Therefore, users can set the size of tmpfs with
> 'rootflags=size=1024M'.

Why is that exactly?

> Signed-off-by: Menglong Dong <dong.menglong@xxxxxxxxxx>
> ---
> init/do_mounts.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
> init/do_mounts.h | 7 ++++-
> init/initramfs.c | 10 +++++++
> 3 files changed, 88 insertions(+), 1 deletion(-)
>
> diff --git a/init/do_mounts.c b/init/do_mounts.c
> index a78e44ee6adb..943cb58846fb 100644
> --- a/init/do_mounts.c
> +++ b/init/do_mounts.c
> @@ -617,6 +617,78 @@ void __init prepare_namespace(void)
> init_chroot(".");
> }
>
> +#ifdef CONFIG_TMPFS
> +static __init bool is_tmpfs_enabled(void)
> +{
> + return (!root_fs_names || strstr(root_fs_names, "tmpfs")) &&
> + !saved_root_name[0];
> +}
> +#endif
> +
> +static __init bool is_ramfs_enabled(void)
> +{
> + return true;
> +}
> +
> +struct fs_user_root {
> + bool (*enabled)(void);
> + char *dev_name;

What's the point of dev_name if its never set?

> + char *fs_name;
> +};
> +
> +static struct fs_user_root user_roots[] __initdata = {
> +#ifdef CONFIG_TMPFS
> + {.fs_name = "tmpfs", .enabled = is_tmpfs_enabled },
> +#endif
> + {.fs_name = "ramfs", .enabled = is_ramfs_enabled }
> +};
> +static struct fs_user_root * __initdata user_root;
> +
> +/* Mount the user_root on '/'. */
> +int __init mount_user_root(void)
> +{
> + return do_mount_root(user_root->dev_name,

See, isn't dev_name here always NULL?

> + user_root->fs_name,
> + root_mountflags & ~MS_RDONLY,
> + root_mount_data);
> +}
> +
> +/*
> + * This function is used to chroot to new initramfs root that
> + * we unpacked on success.

Might be a good place to document that we do this so folks can
pivot_root on rootfs, and why that is desirable (mentioned above on the
commit log edits I suggested). Otherwise I don't think its easy for a
reader of the code to understand why we are doing all this work.

> It will chdir to '/' and umount
> + * the secound mount on failure.
> + */
> +void __init end_mount_user_root(bool succeed)
> +{
> + if (!succeed)
> + goto on_failed;
> +
> + if (!ramdisk_exec_exist(false))
> + goto on_failed;
> +
> + init_mount(".", "/", NULL, MS_MOVE, NULL);
> + init_chroot(".");
> + return;
> +
> +on_failed:
> + init_chdir("/");
> + init_umount("/..", 0);
> +}

Is anything extra needed on shutdown / reboot?

Luis