[RFC PATCH 0/9] fuse: API for Checkpoint/Restore

From: Alexander Mikhalitsyn
Date: Mon Feb 20 2023 - 14:38:23 EST


Hello everyone,

It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.

Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]

The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
restorable process tree, but instead use a "fake daemon".

This leads to many other technical problems:
* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
like no_open, no_flush, readahead, and so on).
* each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.

We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...

My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
because we have to restart all the containers (fuse bindmounts become invalid after the crash).
The solution is fairly easy:
allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
and other stuff, we are only interested in mounts.

Current PoC implementation was developed and tested with this "recovery case".
Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.

Brief API/design description.

I've added two ioctl's:
* ioctl(FUSE_DEV_IOC_REINIT)
Performs fuse connection abort and then reinitializes all internal fuse structures as "brand new".
Then sends a FUSE_INIT request, so a new userspace daemon can reply to it and start processing fuse reqs.

* ioctl(FUSE_DEV_IOC_BM_REVAL)
A little bit hacky thing. Traverses all the bindmounts of existing fuse mount and performs relookup
of (struct vfsmount)->mnt_root dentries with the new daemon and reset them to new dentries.
Pretty useful for the recovery case (for instance, LXCFS).

Now about the dentry/inode invalidation mechanism.
* added the "fuse connection generation" concept.
When reinit is performed then connection generation is increased by 1.
Each fuse inode stores the generation of the connection it was allocated with.

* perform dentry revalidation if it has an old generation [fuse_dentry_revalidate]
The current implementation of fuse_dentry_revalidate follows a simple and elegant idea. When we
want to revalidate the dentry we just send a FUSE_LOOKUP request to the userspace
for the parent dentry with the name of the current dentry and check which attributes/inode id
it gets. If inode ids are the same and attributes (provided by the userspace) are valid
then we mark dentry valid and it continues to live (with inode).
I've only added a connection generation check to the condition when we have to perform revalidation
and added an inode connection generation reset (to actual connection gen) if the new userspace
daemon has looked up the same inode id (important for the CRIU case!).

Thank you for your attention and I'm waiting for your feedback :)

References:
[1] Support FUSE mountpoints https://github.com/checkpoint-restore/criu/issues/53
[2] Bringing up FUSE mounts C/R support https://lpc.events/event/16/contributions/1243/
[3] LXCFS https://github.com/lxc/lxcfs

Kind regards,
Alex

Cc: Miklos Szeredi <mszeredi@xxxxxxxxxx>
Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Cc: Amir Goldstein <amir73il@xxxxxxxxx>
Cc: Stéphane Graber <stgraber@xxxxxxxxxx>
Cc: Seth Forshee <sforshee@xxxxxxxxxx>
Cc: Christian Brauner <brauner@xxxxxxxxxx>
Cc: Andrei Vagin <avagin@xxxxxxxxx>
Cc: Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx>
Cc: linux-fsdevel@xxxxxxxxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: criu@xxxxxxxxxx

Alexander Mikhalitsyn (9):
fuse: move FUSE_DEFAULT_* defines to fuse common header
fuse: add const qualifiers to common fuse helpers
fuse: add fuse connection generation
fuse: handle stale inode connection in fuse_queue_forget
fuse: move fuse connection flags to the separate structure
fuse: take fuse connection generation into account
fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
namespace: add sb_revalidate_bindmounts helper
fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL)

fs/fuse/acl.c | 6 +-
fs/fuse/dev.c | 167 +++++++++++++++++++-
fs/fuse/dir.c | 38 ++---
fs/fuse/file.c | 85 +++++-----
fs/fuse/fuse_i.h | 281 ++++++++++++++++++++--------------
fs/fuse/inode.c | 62 ++++----
fs/fuse/readdir.c | 8 +-
fs/fuse/xattr.c | 18 +--
fs/namespace.c | 90 +++++++++++
include/linux/mnt_namespace.h | 3 +
include/uapi/linux/fuse.h | 2 +
11 files changed, 531 insertions(+), 229 deletions(-)

--
2.34.1