RE: [PATCH/RFC] Lustre VFS patch, version 2

From: Peter J. Braam
Date: Wed Jun 02 2004 - 18:19:42 EST


Hello!

The feedback of the Lustre patches was of very high quality, thanks a
lot for studying it carefully. Things are simpler now.

Oleg Drokin and I discussed the emails extensively and here is our reply.
We have attached another collection of patches, addressing many of the
concerns.
We felt it is was perhaps easier to keep this all in one long email.

People requested to see the code that uses the patch. We have uploaded that
to:

ftp://ftp.clusterfs.com:/pub/lustre/lkml/lustre-client_and_mds.tgz

The client file system is the primary user of the kernel patch, in the
llite directory. The MDS server is a sample user of do_kern_mount. As
requested I have removed many other things from the tar ball to make
review simple (so this won't compile or run).

1. Export symbols concerns by Christoph Hellwig:

Indeed we can do without __iget, kernel_text_address, reparent_to_init
and exit_files.

We actually need do_kern_mount and truncate_complete_page. Do kern
mount is used because we use a file system namespace in servers in the
kernel without exporting it to user space (mds/handler.c). The server
file systems are ext3 file systems but we replace VFS locking with DLM
locks, and it would take considerable work to export that as a file
system.

Truncate_complete_page is used to remove pages in the middle of a file
mapping, when lock revocations happen (llite/file.c
ll_extent_lock_callback, calling ll_pgcache_remove_extent) .

2. lustre_version.patch concerns by Christoph Hellwig:

This one can easily be removed, but kernel version alone does not
necessarily represent anything useful. There are tons of people
patching their kernel with patches, even applying parts of newer
kernel and still leaving kernel version at its old value
(distributions immediately come to mind). So we still need something
to identify version of necessary bits. E.g. version of intent API.

3. Introduction of lock-less version of d_rehash (__d_rehash) by
Christoph Hellwig:

In some places lustre needs to do several things to dentry's with
dcache lock held already, e.g. traverse alias dentries in inode to
find one with same name and parent as the one we have already. Lustre
can invalidate busy dentries, which we put on a list. If these are
looked up again, concurrently, we find them on this list and re-use
them, to avoid having several identical aliases in an inode. See
llite/{dcache.c,namei.c} ll_revalidate and the lock callback function
ll_mdc_blocking_ast which calls ll_unhash_aliases. We use d_move to
manipulate dentries associated with raw inodes and names in ext3.

4. vfs intent API changes kernel exported concern API by Christoph
Hellwig:

With slight modification it is possible to reduce the changes to just
changes in the name of intent structure itself and some of its
fields.

This renaming was requested by Linus, but we can change names back
easily if needed, that would avoid any api change. Are there other
users, please let us know what to do?

All the functions can easily be split into valid intent expecting ones
(with some suffix in name like _it) and those that are part of old API
would just initialise the intent to something sensible and then call
corresponding intent-expecting function. No harm should be done to
external filesystems this way. We have modified vfs intent API patch
to achieve this.

5. Some objections from Trond Myklebust about open flags in exec, cwd
revalidation, and revalidate_counter patch:

We have fixed the exec open flags issue (our error). Also
revalidate_counter patch was dropped since we can do this inside
lustre as well. CWD revalidation can be converted to FS_REVAL_DOT in
fs flags instead, but we still need part of that patch, the
LOOKUP_LAST/LOOKUP_NOT_LAST part. Lustre needs to know when we reached
the last component in the path so that intent needs to be looked
at. (It seems we cannot use LOOKUP_CONTINUE for this reliably).

6. from Trond Myklebust:

> The vfs-intent_lustre-vanilla-2.6.patch + the "intent_release()"
> code. What if you end up crossing a mountpoint? How do you then know
> to which superblock/filesystem the private field belongs if there are
> more than one user of this mechanism?

Basically intent only makes sence for the last component. Our code
checks that and if we are doing lookup a component before the last,
then a dummy IT_LOOKUP intent is created on stack and we work with
that, perhaps the same is true for other filesystems that would like
to use this mechanism.

7. raw operations concerns by various people:

We have now implemented an alternative approach to this, that is
taking place when parent lookup is done, using intents. For setattr
we managed to remove the raw operations alltogether, (praying that we
haven't forgotten some awful problem we solved that led to the
introduction of setattr_raw in the first place).

The correctly filled intent is recognised by filesystem's lookup or
revalidate method. After the parent is looked up, based on the intent
the correct "raw" server call is executed, within the file
system. Then a special flag is set in intent, the caller of parent
lookup checks for the flag and if it is set, the functions returns
immediately with supplied (in intent)exit code, without instantiating
child dentries.

This needs some minor changes to VFS, though. There are at
least two approaches.

One is to not introduce any new methods and just rely on fs' metohds
to do everything, for this to work filesystem needs to know the
remaining path to be traversed (we can fill nd->last with remaining
path before calling into fs). In the root directory of the mount, we
need to call a revalidate (if supported by fs) on mountpoint to
intercept the intent, after we crossed mountpoint. We have this
approach implemented in that attached patch. Does it look better than
the raw operations?

Much simpler for us is to add additional inode operation
"process_intent" method that would be called when LOOKUP_PARENT sort
of lookup was requested and we are about to leave link_path_walk()
with nameidata structure filled and everything ready. Then the same
flag in intent will be set and everything else as in previous
approach.

We believe both methods are less intrusive than the raw methods, but
slightly more delicate.

8. Mountpoint-crossing issues during rename (and link) noticed by
Arjan van de Ven:

Well, indeed this can happen if source or destination is a mountpoint
on client but not server, this needs to be addressed by individual
filesystems that chose to implement those raw methods.

9. dev_readonly patch concerns by Jens Axboe:

We already clarified why we need it in this exact way. But there were
some valid suggestions to use other means like dm-flakey device mapper
module, so we decided to write a failure simulator DM.

10. "Have these patches undergone any siginifant test?" by Anton Blanchard:

There are two important questions I think:
- Do the patches cause damage?
Probably not anymore. SUSE has done testing and it appears the
original patch I attached didn't break things (after one fix was
made).
- Is Lustre stable?
On 2.4 Lustre is quite stable. On 2.6 we have done testing but,
for example, never more than on 40 nodes. We don't consider it
rock solid on 2.6, it does pass POSIX and just about every other
benchmark without failures.

Since the patches were modified for this discussion there are of
course some new issues which Oleg Drokin is now ironing out.

Our test results are visible at https://buffalo.lustre.org

Well, how close are we now to this being acceptable?

- Peter J. Braam & Oleg Drokin -

Attachment: export-vanilla-2.6.patch
Description: Binary data

Attachment: header_guards-vanilla-2.6.patch
Description: Binary data

Attachment: lustre_version.patch
Description: Binary data

Attachment: vanilla-2.6.6
Description: Binary data

Attachment: vfs_intent-flags_rename-vanilla-2.6.patch
Description: Binary data

Attachment: vfs-dcache_locking-vanilla-2.6.patch
Description: Binary data

Attachment: vfs-dcache_lustre_invalid-vanilla-2.6.patch
Description: Binary data

Attachment: vfs-do_truncate.patch
Description: Binary data

Attachment: vfs-intent_api-vanilla-2.6.patch
Description: Binary data

Attachment: vfs-intent_lustre-vanilla-2.6.patch
Description: Binary data

Attachment: vfs-lookup_last-vanilla-2.6.patch
Description: Binary data

Attachment: vfs-raw_ops-vanilla-2.6.patch
Description: Binary data