Re: [PATCH 3/3] Add a pair of system calls to make extended filestats available [ver #2]

From: Trond Myklebust
Date: Tue Jun 29 2010 - 21:49:35 EST


On Wed, 2010-06-30 at 02:17 +0100, David Howells wrote:
> Add a pair of system calls to make extended file stats available, including
> file creation time, inode version and data version where available through the
> underlying filesystem:
>
> struct xstat_dev {
> unsigned int major;
> unsigned int minor;
> };
>
> struct xstat_time {
> unsigned long long tv_sec;
> unsigned long long tv_nsec;
> };
>
> struct xstat {
> unsigned int struct_version;
> #define XSTAT_STRUCT_VERSION 0
> unsigned int st_mode;
> unsigned int st_nlink;
> unsigned int st_uid;
> unsigned int st_gid;
> unsigned int st_blksize;
> struct xstat_dev st_rdev;
> struct xstat_dev st_dev;
> unsigned long long st_ino;
> unsigned long long st_size;
> struct xstat_time st_atime;
> struct xstat_time st_mtime;
> struct xstat_time st_ctime;
> struct xstat_time st_btime;
> unsigned long long st_blocks;
> unsigned long long st_gen;
> unsigned long long st_data_version;
> unsigned long long query_flags;
> #define XSTAT_QUERY_SIZE 0x00000001ULL
> #define XSTAT_QUERY_NLINK 0x00000002ULL
> #define XSTAT_QUERY_AMC_TIMES 0x00000004ULL
> #define XSTAT_QUERY_CREATION_TIME 0x00000008ULL
> #define XSTAT_QUERY_BLOCKS 0x00000010ULL
> #define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL
> #define XSTAT_QUERY_DATA_VERSION 0x00000040ULL
> unsigned long long extra_results[0];
> };
>
> ssize_t ret = xstat(int dfd,
> const char *filename,
> unsigned atflag,
> struct xstat *buffer,
> size_t buflen);
>
> ssize_t ret = fxstat(int fd,
> struct xstat *buffer,
> size_t buflen);
>
>
> The dfd, filename, atflag and fd parameters indicate the file to query. There
> is no equivalent of lstat() as that can be emulated with xstat(), passing 0
> instead of AT_SYMLINK_NOFOLLOW as atflag.
>
> When the system call is executed, the struct_version ID and query_flags bitmask
> are read from the buffer to work out what the user is requesting.
>
> If the structure version specified is not supported, the system call will
> return ENOTSUPP. The above structure is version 0.
>
> The query_flags should be set by the caller to specify extra results that the
> caller may desire. These come in three classes:
>
> (1) Size, nlinks, [amc]times and block count.
>
> These will be returned whether the caller asks for them or not. The
> corresponding bits in query_flags will be set to indicate their presence.
>
> If the called didn't ask for them, then they may be approximated. For
> example, NFS won't waste any time updating them from the server, unless
> as a byproduct of updating something requested.
>
> Query Flag Field
> =============================== ================
> XSTAT_QUERY_SIZE st_size
> XSTAT_QUERY_NLINK st_nlink
> XSTAT_QUERY_AMC_TIMES st_[amc]time
> XSTAT_QUERY_BLOCKS st_blocks
>
> (2) Creation time, Inode generation and Data version.
>
> These will be returned if available whether the caller asked for them or
> not. The corresponding bits in query_flags will be set or cleared as
> appropriate to indicate their presence.
>
> Query Flag Field
> =============================== ================
> XSTAT_QUERY_CREATION_TIME st_btime
> XSTAT_QUERY_INODE_GENERATION st_gen
> XSTAT_QUERY_DATA_VERSION st_data_version
>
> If the called didn't ask for them, then they may be approximated. For
> example, NFS won't waste any time updating them from the server, unless
> as a byproduct of updating something requested.
>
> (3) Extra results.
>
> These will only be returned if the caller asked for them by setting their
> bits in query_flags. They will be placed in the buffer after the xstat
> struct in ascending query_flags bit order. Any bit set in query_flags
> mask will be left set if the result is available and cleared otherwise.
>
> The pointer into the results list will be rounded up to the nearest 8-byte
> boundary after each result is written in. The size of each extra result
> is specific to the definition for that result.
>
> No extra results are currently defined.
>
> If the buffer is insufficiently big, the syscall returns the amount of space it
> will need to write the complete result set, but otherwise does nothing.
>
> If successful, the amount of data written into the buffer will be returned.
>
> At the moment, this will only work on x86_64 as it requires system calls to be
> wired up.
>
>
> ===========
> FILESYSTEMS
> ===========
>
> The following filesystems have been modified to make use of this facility:
>
> (*) Ext4. This will return the creation time and inode version number for all
> files. It will, however, only return the data version number for
> directories as i_version is only maintained for them.
>
> (*) AFS. This will return the vnode ID uniquifier as the inode version and
> the AFS data version number as the data version. There is no file
> creation time available.
>
> (*) NFS. This will return the change attribute if NFSv4 only. No other extra
> values are returned at this time. If mtime and ctime aren't asked for,
> the outstanding writes won't be written to the server. If none of
> [amc]time, size, nlink, blocks and data_version are requested, then the
> attributes won't be refreshed from the server.
>
> Probably this isn't sufficient, as the other non-optional attributes may
> require refreshing.
>
>
> =======
> TESTING
> =======
>
> The following test program can be used to test the xstat system call:
>
> #define _GNU_SOURCE
> #define _ATFILE_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <time.h>
> #include <sys/syscall.h>
> #include <sys/stat.h>
> #include <sys/types.h>
>
> struct xstat_dev {
> unsigned int major;
> unsigned int minor;
> };
>
> struct xstat_time {
> unsigned long long tv_sec;
> unsigned long long tv_nsec;
> };
>
> struct xstat {
> unsigned int struct_version;
> #define XSTAT_STRUCT_VERSION 0
> unsigned int st_mode;
> unsigned int st_nlink;
> unsigned int st_uid;
> unsigned int st_gid;
> unsigned int st_blksize;
> struct xstat_dev st_rdev;
> struct xstat_dev st_dev;
> unsigned long long st_ino;
> unsigned long long st_size;
> struct xstat_time st_atim;
> struct xstat_time st_mtim;
> struct xstat_time st_ctim;
> struct xstat_time st_btim;
> unsigned long long st_blocks;
> unsigned long long st_gen;
> unsigned long long st_data_version;
> unsigned long long query_flags;
> #define XSTAT_QUERY_SIZE 0x00000001ULL /* want/got st_size */
> #define XSTAT_QUERY_NLINK 0x00000002ULL /* want/got st_nlink */
> #define XSTAT_QUERY_AMC_TIMES 0x00000004ULL /* want/got st_[amc]time */
> #define XSTAT_QUERY_CREATION_TIME 0x00000008ULL /* want/got st_btime */
> #define XSTAT_QUERY_BLOCKS 0x00000010ULL /* want/got st_blocks */
> #define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL /* want/got st_gen */
> #define XSTAT_QUERY_DATA_VERSION 0x00000040ULL /* want/got st_data_version */
> #define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL /* the stuff in the normal stat struct */
> #define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL /* what we get anyway if available */
> #define XSTAT_QUERY__DEFINED_SET 0x0000007fULL /* the defined set of flags */
> unsigned long long extra_results[0];
> };
>
> #define __NR_xstat 300
> #define __NR_fxstat 301
>
> static __attribute__((unused))
> ssize_t xstat(int dfd, const char *filename, int atflag,
> struct xstat *buffer, size_t bufsize)
> {
> return syscall(__NR_xstat, dfd, filename, atflag, buffer, bufsize);
> }
>
> static __attribute__((unused))
> ssize_t fxstat(int fd, struct xstat *buffer, size_t bufsize)
> {
> return syscall(__NR_fxstat, fd, buffer, bufsize);
> }
>
> static void print_time(const struct xstat_time *xstm)
> {
> struct tm tm;
> time_t tim;
> char buffer[100];
> int len;
>
> tim = xstm->tv_sec;
> if (!localtime_r(&tim, &tm)) {
> perror("localtime_r");
> exit(1);
> }
> len = strftime(buffer, 100, "%F %T", &tm);
> if (len == 0) {
> perror("strftime");
> exit(1);
> }
> fwrite(buffer, 1, len, stdout);
> printf(".%09llu", xstm->tv_nsec);
> len = strftime(buffer, 100, "%z", &tm);
> if (len == 0) {
> perror("strftime2");
> exit(1);
> }
> fwrite(buffer, 1, len, stdout);
> }
>
> static void dump_xstat(struct xstat *xst)
> {
> char buffer[256], ft;
>
> printf(" ");
> if (xst->query_flags & XSTAT_QUERY_SIZE)
> printf(" Size: %-15llu", xst->st_size);
> if (xst->query_flags & XSTAT_QUERY_BLOCKS)
> printf(" Blocks: %-10llu", xst->st_blocks);
> printf(" IO Block: %-6u ", xst->st_blksize);
> switch (xst->st_mode & S_IFMT) {
> case S_IFIFO: printf(" FIFO\n"); ft = 'p'; break;
> case S_IFCHR: printf(" character special file\n"); ft = 'c'; break;
> case S_IFDIR: printf(" directory\n"); ft = 'd'; break;
> case S_IFBLK: printf(" block special file\n"); ft = 'b'; break;
> case S_IFREG: printf(" regular file\n"); ft = '-'; break;
> case S_IFLNK: printf(" symbolic link\n"); ft = 'l'; break;
> case S_IFSOCK: printf(" socket\n"); ft = 's'; break;
> default:
> printf("unknown type (%o)\n", xst->st_mode & S_IFMT);
> ft = '?';
> break;
> }
>
> sprintf(buffer, "%02x:%02x", xst->st_dev.major, xst->st_dev.minor);
> printf("Device: %-15s Inode: %-11llu", buffer, xst->st_ino);
> if (xst->query_flags & XSTAT_QUERY_SIZE)
> printf(" Links: %u", xst->st_nlink);
> printf("\n");
>
> printf("Access: (%04o/%c%c%c%c%c%c%c%c%c%c) ",
> xst->st_mode & 07777,
> ft,
> xst->st_mode & S_IRUSR ? 'r' : '-',
> xst->st_mode & S_IWUSR ? 'w' : '-',
> xst->st_mode & S_IXUSR ? 'x' : '-',
> xst->st_mode & S_IRGRP ? 'r' : '-',
> xst->st_mode & S_IWGRP ? 'w' : '-',
> xst->st_mode & S_IXGRP ? 'x' : '-',
> xst->st_mode & S_IROTH ? 'r' : '-',
> xst->st_mode & S_IWOTH ? 'w' : '-',
> xst->st_mode & S_IXOTH ? 'x' : '-');
> printf("Uid: %d Gid: %u\n", xst->st_uid, xst->st_gid);
>
> if (xst->query_flags & XSTAT_QUERY_AMC_TIMES) {
> printf("Access: "); print_time(&xst->st_atim); printf("\n");
> printf("Modify: "); print_time(&xst->st_mtim); printf("\n");
> printf("Change: "); print_time(&xst->st_ctim); printf("\n");
> }
> if (xst->query_flags & XSTAT_QUERY_CREATION_TIME) {
> printf("Create: "); print_time(&xst->st_btim); printf("\n");
> }
>
> if (xst->query_flags & XSTAT_QUERY_INODE_GENERATION)
> printf("Inode version: %llxh\n", xst->st_gen);
> if (xst->query_flags & XSTAT_QUERY_DATA_VERSION)
> printf("Data version: %llxh\n", xst->st_data_version);
> }
>
> int main(int argc, char **argv)
> {
> struct xstat xst;
> int ret, atflag = AT_SYMLINK_NOFOLLOW;
>
> unsigned long long query =
> XSTAT_QUERY__ORDINARY_SET |
> XSTAT_QUERY_CREATION_TIME |
> XSTAT_QUERY_INODE_GENERATION |
> XSTAT_QUERY_DATA_VERSION;
>
> for (argv++; *argv; argv++) {
> if (strcmp(*argv, "-L") == 0) {
> atflag = 0;
> continue;
> }
> if (strcmp(*argv, "-O") == 0) {
> query &= ~XSTAT_QUERY__ORDINARY_SET;
> continue;
> }
>
> memset(&xst, 0xbf, sizeof(xst));
> xst.struct_version = 0;
> xst.query_flags = query;
> ret = xstat(AT_FDCWD, *argv, atflag, &xst, sizeof(xst));
> printf("xstat(%s) = %d\n", *argv, ret);
> if (ret < 0) {
> perror(*argv);
> exit(1);
> }
>
> printf("sv=%u qf=%llx cr=%llx.%llx iv=%llx dv=%llx\n",
> xst.struct_version, xst.query_flags,
> xst.st_btim.tv_sec, xst.st_btim.tv_nsec,
> xst.st_gen, xst.st_data_version);
>
> dump_xstat(&xst);
> }
> return 0;
> }
>
> Just compile and run, passing it paths to the files you want to examine:
>
> [root@andromeda ~]# /tmp/xstat /afs/archive/linuxdev/fedora9/i386/repodata/
> xstat(/afs/archive/linuxdev/fedora9/i386/repodata/) = 152
> sv=0 qf=77 cr=0.0 iv=7a5 dv=5
> Size: 2048 Blocks: 0 IO Block: 4096 directory
> Device: 00:15 Inode: 83 Links: 2
> Access: (0755/drwxr-xr-x) Uid: 75338 Gid: 0
> Access: 2008-11-05 20:00:12.000000000+0000
> Modify: 2008-11-05 20:00:12.000000000+0000
> Change: 2008-11-05 20:00:12.000000000+0000
> Inode version: 7a5h
> Data version: 5h
>
> [root@andromeda ~]# /tmp/xstat /warthog/nfs/linux-2.6-fscache
> xstat(/warthog/nfs/linux-2.6-fscache) = 152
> sv=0 qf=57 cr=0.0 iv=0 dv=f4992a4c00000000
> Size: 4096 Blocks: 16 IO Block: 1048576 directory
> Device: 00:13 Inode: 19005487 Links: 27
> Access: (2775/drwxrwxr-x) Uid: -2 Gid: 4294967294
> Access: 2010-06-30 02:07:42.000000000+0100
> Modify: 2010-06-30 02:12:20.000000000+0100
> Change: 2010-06-30 02:12:20.000000000+0100
> Data version: f4992a4c00000000h
>
> [root@andromeda ~]# /tmp/xstat /var/cache/fscache/cache/
> xstat(/var/cache/fscache/cache/) = 152
> sv=0 qf=7f cr=4c24ba83.1c15ee3d iv=f585ab70 dv=2
> Size: 4096 Blocks: 16 IO Block: 4096 directory
> Device: 08:06 Inode: 130561 Links: 3
> Access: (0700/drwx------) Uid: 0 Gid: 0
> Access: 2010-06-29 18:16:33.680703545+0100
> Modify: 2010-06-29 18:16:20.132786632+0100
> Change: 2010-06-29 18:16:20.132786632+0100
> Create: 2010-06-25 15:17:39.471199293+0100
> Inode version: f585ab70h
> Data version: 2h

Yes, but could we please also add a flag that allows you to specify that
the kernel _must_ provide up to date attributes.

IOW: a flag that for something like NFS or CIFS will force a GETATTR RPC
call on the wire as opposed to using cached values.

Cheers
Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/