Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field

From: NeilBrown
Date: Thu Sep 08 2022 - 18:41:19 EST


On Fri, 09 Sep 2022, Theodore Ts'o wrote:
> On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote:
> > It boils down to the fact that we don't want to call mark_inode_dirty()
> > from IOCB_NOWAIT path because for lots of filesystems that means journal
> > operation and there are high chances that may block.
> >
> > Presumably we could treat inode dirtying after i_version change similarly
> > to how we handle timestamp updates with lazytime mount option (i.e., not
> > dirty the inode immediately but only with a delay) but then the time window
> > for i_version inconsistencies due to a crash would be much larger.
>
> Perhaps this is a radical suggestion, but there seems to be a lot of
> the problems which are due to the concern "what if the file system
> crashes" (and so we need to worry about making sure that any
> increments to i_version MUST be persisted after it is incremented).
>
> Well, if we assume that unclean shutdowns are rare, then perhaps we
> shouldn't be optimizing for that case. So.... what if a file system
> had a counter which got incremented each time its journal is replayed
> representing an unclean shutdown. That shouldn't happen often, but if
> it does, there might be any number of i_version updates that may have
> gotten lost. So in that case, the NFS client should invalidate all of
> its caches.

I was also thinking that the filesystem could help close that gap, but I
didn't like the "whole filesysem is dirty" approach.
I instead imagined a "dirty" bit in the on-disk inode which was set soon
after any open-for-write and cleared when the inode was finally written
after there are no active opens and no unflushed data.
The "soon after" would set a maximum window on possible lost version
updates (which people seem to have comfortable with) without imposing a
sync IO operation on open (for first write).

When loading an inode from disk, if the dirty flag was set then the
difference between current time and on-disk ctime (in nanoseconds) could
be added to the version number.

But maybe that is too complex for the gain.

NeilBrown