Re: [RFC PATCH 2/3] fat: add renameat2 RENAME_EXCHANGE flag support

From: Javier Martinez Canillas
Date: Mon May 23 2022 - 11:34:31 EST


Hello Colin,

Thanks for your feedback.

On 5/23/22 12:40, Colin Walters wrote:
> On Thu, May 19, 2022, at 5:23 AM, Javier Martinez Canillas wrote:
>> The renameat2 RENAME_EXCHANGE flag allows to atomically exchange two paths
>> but is currently not supported by the Linux vfat filesystem driver.
>>
>> Add a vfat_rename_exchange() helper function that implements this support.
>>
>> The super block lock is acquired during the operation to ensure atomicity,
>> and in the error path actions made are reversed also with the mutex held,
>> making the whole operation transactional.
>
> Transactional with respect to the mounted kernel, but AIUI because vfat does not have journaling, the semantics on hard failure are...unspecified? Is it possible for example we could see no file at all in the destination path?
>

That's correct, it's transactional within the constraints imposed by vfat.
That is, there's no journal replay that would be done if something gets
corrupted in the filesystem.

But I believe that's also true with any journaled filesystem and GRUB too?
Since GRUB doesn't mount filesystems but just attempt to read it without
trying to do any journal replay. Even if is able to detect that something
is wrong with the filesystem, it just tries in an best effort basis, i.e:

https://git.savannah.gnu.org/cgit/grub.git/commit/?id=777276063e2

About the semantics for a hard failure, that's not documented in the man
page for the renameat(2) system call but what most filesystems do AFAICT
is revert the operation if possible and print an error.

I don't think that not having a file at all at destination is a possible
outcome of a failure since the function does a detach, attach and sync
and only the sync can fail.

If the sync fails, then the detach/attach are reverted and another sync
is attempted. If this succeeds, then the old state would be preserved
and if it fails, then no sync was made so it should be good too I think.

But I'm not a filesystem expert so maybe someone else more familiar with
vfat and filesystems in general could chime in.

> This relates to https://github.com/ostreedev/ostree/issues/1951
>
> TL;DR I'd been thinking that in order to have things be maximally robust we need to:
>
> 1. Write new desired bootloader config
> 2. fsync it
> 3. fsync containing directory (I guess for vfat really, syncfs())
> 4. remove old config, syncfs()
>

Yes, I've seen that issue before but I (wrongly) understood that it was a
way to workaround the lack of renameat2(..., RENAME_EXCHANGE) in vfat. On
a second read I see that you also mention the journaled fs writes vs no
replay in the bootloader issue that I mentioned above. So it makes sense
to do the two phase commit even for journaled filesystems.

> And here the bootloader would know to prefer the "new" file if it exists, and to delete the old one if it's still present on the next boot.
>

This is the disadvantage of this approach, that then we will need to make
all bootloaders aware of the two phase commit as well. I'm OK with that but
then I believe that we should document the expectations clearly as a part
of the https://systemd.io/BOOT_LOADER_SPECIFICATION/.

Anyways, I don't think this is the place to discuss this though and we should
just focus on the actual kernel patches :)

> (Now obviously this is a small patch which will surely be generally useful, e.g. for tools that operate on things like mounted USB sticks, being able to do an atomic exchange at least from the running kernel PoV is just as useful as it is on other "regular" (and journaled) mounted filesystems)
>

Agreed. I think that it wouldn't hurt to have this implementation in vfat.

> So assuming we have this, I guess the flow could be:
>
> 1. rename_exchange(old, new)
> 2. syncfs()
>

Correct. In fact, Alex pointed me out that I should do sync in the test too
before checking that the rename succeeded. I was mostly interested that the
logic worked even if only the in-memory representation or page cache was
used. But I've added a `sudo sync -f "${MNT_PATH}"` for the next iteration.

> ? But that's assuming that the implementation of this doesn't e.g. have any "holes" where in theory we could flush an intermediate state.
>

Ogawa said that didn't fully review it yet but gave useful feedback that I
will also address in the next version. As said, is my first contribution to
a filesystem driver so it would be good if people with more experience can
let me know if there are holes in the implementation.

--
Best regards,

Javier Martinez Canillas
Linux Engineering
Red Hat