Re: [PATCH 1/4] statx: Add a system call to make enhanced file info available

From: David Howells
Date: Fri Nov 18 2016 - 13:04:22 EST


Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:

> > We've already been through that. I wanted to call it stx_data_version but
> > that got argued down to stx_version. The problem is that what the version
> > number means is entirely filesystem dependent, and it might not just reflect
> > changes in the data.
> >
>
> It had better not just reflect data changes.
>
> knfsd populates the NFSv4 change attribute from inode->i_version. It
> _must_ have changed between subsequent queries if either the data or
> metadata has changed (basically whenever you would update either the
> ctime or the mtime).

No, I think it *should* just reflect the data changes - otherwise you have
have to burn your cached data unnecessarily.

> > > So if stx_version this is intended to export the internal filesystem
> > > inode change counter (i.e. inode->i_version) then lets call it that:
> > > stx_modification_count. It's clear and unambiguous as to what it
> > > represents, especially as this counter is more than just a "data
> > > modification" counter - inode metadata modifications will also
> > > cause it to change....
> >
> > I disagree that it's unambiguous. It works like mtime, right?
>
> More like ctime + mtime mashed together.

Isn't ctime updated every time mtime is? In which case stx_change_count would
be a better name.

> > Which wouldn't be of use for certain filesystems. An example of this
> > would be AFS, where it's incremented by 1 each time a write is committed,
> > but is not updated for metadata changes. This is what matters for data
> > caching.
> >
>
> No. Basically the rules are that if something in the inode data or
> metadata changed, then it must be a "larger" value (also accounting for
> wraparound). So you also need to change it (usually by incrementing it)
> when doing namespace changes that involve it (renames, unlinks, etc.).

That's entirely filesystem dependent.

A better rule is that if you do a write and then compare the data version you
got back to the version you had before; if it's increased by exactly one,
there were no other writes between your last retrieval of the attributes and
your write that just got committed. Admittedly, this assumes that the server
serialises writes to a particular file.

If the value just increases, you don't know that didn't happen by this
mechanism, so the version is of limited value.

> Adding new fields in later piecemeal patches allows us to demonstrate
> that that concept actually works.

You're probably right, but the downside is that we really need some way to
find out what's supported. On the other hand, we probably need that anyway,
hence my suggestion of an fsinfo() syscall also.

> > You really think we're going to have accurate timestamps with a resolution
> > of a millionth of a nanosecond? This means you're going to be doing a
> > 64-bit division every time you want a nanosecond timestamp.
> ...
>
> Could contemporary machines get away with just shifting down by 32
> bits?

A better way would probably be to have:

struct timestamp {
__u64 seconds;
__u32 nanoseconds;
__u32 femtoseconds;
};

where you effectively add all the fields together with appropriate
multipliers.

But I still wonder if we really are going to move to femtosecond timestamps,
given that that's going to involve clock frequencies well in excess of 1 THz
to be useful. Even attoseconds is probably unnecessary, given that clock
frequencies don't seem to be moving much beyond a few GHz, though it's
reasonable that we could have a timestamp counter that has an attosecond
period - it's just that the processing time to deal with it seems likely to
render it unnecessary.

David