Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

From: Andreas Dilger
Date: Mon Mar 20 2006 - 18:46:37 EST

Next message: Chris Wright: "Re: [PATCH] scm: fold __scm_send() into scm_send()"
Previous message: Jun'ichi Nomura: "Re: [PATCH 02/23] kobject: fix build error if CONFIG_SYSFS=n"
In reply to: Stephen C. Tweedie: "Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)"
Next in thread: Stephen C. Tweedie: "Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mar 20, 2006 17:38 -0500, Stephen C. Tweedie wrote:
> On Sun, 2006-03-19 at 23:36 -0700, Andreas Dilger wrote:
> > What happens to existing filesystems with large inodes that don't have
> > enough space for the extra timestamps in the first place?
>
> Sadly, they are basically out of luck, unless we change the way that
> space in the extended inode is used.
>
> We could have defined things such that you could either use the in-inode
> space, OR the external space, for xattrs, but not both. But that would
> be a performance compromise at best, for some of the most important
> xattrs (like SELinux labels, which are always there and are always
> needed) really want to be accelerated in the inode.

The fast EA space is the only reason we implemented this at all. We
also still need the external EA space for the overflow case.

> I would expect that the vast majority of filesystems won't have any
> inodes that have fully-occupied xattr space.

I would agree. The number of affected files is likely infintesimal,
given that large inodes are not enabled by default (not sure if they
are even documented), Lustre doesn't use more than a single EA on a file
except in the just-release version, and Samba4 doesn't care because it
stores its timestamps in EAs anyways due to lack of usecond timestamps.

> It would be easy enough to define a new flag that indicates that
> there is always X amount of space reserved for inode fields, and to set
> that in fsck if all inodes on the fs obey that restriction. Then it
> just comes down to picking a number X that is likely to satisfy all the
> short-term demands for new inode fields.

We could change ext3_new_inode() today to reserve, say, 12 or 16 more
bytes for timestamps, even if they are not implemented yet. Having a
field in the superblock (tunable by the admin, concievably) to reserve
a total of X bytes for i_extra_isize has some appeal though.

At a rough guess I'd want to have timestamps (27 bits for 10s of nanoseconds,
with the high 5 bits left for growing the number of seconds). If we put the
fields in "priority" order, then on those inodes that don't have much space
left we at least get the more important one(s), primarily mtime I think).

__u32 i_mtime_extended;
__u32 i_ctime_extended;
__u32 i_atime_extended;

Are there any other needs right now? My thought on the "extra" fields
in the inode is that they would always be on an "as available" basis,
so if e.g. we only had 8 bytes reserved we would get i_mtime_extended,
and i_ctime_extended, and not i_atime_extended. If there is some added
field that is so important that the kernel/filesystem can't live without
it, it would need its own {RO,IN}COMPAT flag anyways.

I think with the advent of large inodes we can be less worried about
cannibalizing the other "unused" inode fields like i_faddr, i_frag,
i_fsize. An i_blocks_high field, (even in the face of Takashi's recently
proposed patch we would still want another 16 or 32 bits for larger
files, maybe at the same time as his patch is implemented), a 32-bit
inode checksum, more bits for i_nlinks?

It would also be good to understand what HURD is actually doing with
those other fields (if anything, does it even exist anymore?), since
it is literally holding TB of space unusable on Linux ext3 filesystems
that could better be put to use. There are i_translator, i_mode_high,
and i_author held hostage by HURD, and I certainly have never seen or
heard of any good description of what they do or if Linux would/could
ever use them, or if HURD could live without them.

> > Also, if files
> > are created while the filesystem is mounted without usecond timestamps
> > they would get no usecond fields anyways. I agree that there are some
> > unlikely corner conditions that might be hit (large inode filesystem, on
> > older kernel without usec support, fills both the in-inode and external
> > block so much that there isn't 12 bytes left for the usecond timestamps,
> > and that file happens to depend on the exact accuracy of the timestamp).
> > IMHO the inconvenience of the ROCOMPAT outweighs the benefits.
>
> That's precisely the corner case that concerns me. The question is, do
> we want the filesystem to behave correctly in all cases, or do we take
> short-cuts?
>
> I think we're probably early enough in the adoption of large inodes that
> we don't have to make that compromise, and we can reserve some space for
> guaranteed use by inode fields with a single minimally-invasive compat
> change (say, a flag enabling a field in the superblock which defines how
> many bytes we can always safely use for extended inode fields.)

I'm fully in the "the chance of any real problem is vanishingly small"
camp, even though Lustre is one of the few users of large inodes. The
presence of the COMPAT field would not really be any different than just
changing ext3_new_inode() to make i_extra_isize 16 by default, except to
cause breakage against the older e2fsprogs. I don't think it is a bad
idea to implement kernel support for such a flag, but not actually set
it in the superblock unless done so by tune2fs.

Hmm, another "forward looking" change may be to add some masking of bits
in the inode i_flags word. Ted did this with great success for the
EXT2_INDEX_FL. Would it be prudent to do the same with, say, the top 4
bits of i_flags?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Chris Wright: "Re: [PATCH] scm: fold __scm_send() into scm_send()"
Previous message: Jun'ichi Nomura: "Re: [PATCH 02/23] kobject: fix build error if CONFIG_SYSFS=n"
In reply to: Stephen C. Tweedie: "Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)"
Next in thread: Stephen C. Tweedie: "Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]