[-mm patch] Fix inotify umount hangs.

From: Anton Altaparmakov
Date: Mon Jul 04 2005 - 14:31:13 EST


Hi Andrew, Robert,

The below patch against 2.6.13-rc1-mm1 fixes the umount hangs caused by
inotify.

It excludes more inodes from being messed around with in
inotify_unmount_inodes(): the ones with zero i_count as they cannot have
any watches and the I_WILL_FREE ones which it is not allowed to do
__iget() on.

Note that at this stage MS_ACTIVE is cleared on the superblock which means
that iput() does not need shrink_icache_memory(), it evicts the inodes
reaching zero i_count itself and immediately thus if inotify does __iget
and then iput on such zero i_count inode it will evict it from icache and
from the i_sb_list! So the assumption that i_sb_list does not change
whilts inotify_umount_inodes is running is bogus. Even with this patch
the list changes, e.g. when watches are removed i_count can hit zero and
the iput at the bottom of the list_for_each_safe() loop _will_ modify
i_sb_list by evicting the inode from icache there and then.

Further to this NTFS drops references to internal system inodes (also in
icache) during its ->put_inode() method, e.g. this happens for directory
inodes, their bitmap inode is iput when the directory has its final iput
done on it which can be triggered by a removed inotify watch. This can
cause the "next_i" inode to also be evicted which leads to hangs and
crashes since it no longer is on i_sb_list and is indeed freed memory
which it is illegal to access.

This patch solves this by taking a reference on next_i for as long as it
is needed, again with the same constraits as placed on taking a reference
on the inode itself.

This may not be the prettiest but it works. (-:

Signed-off-by: Anton Altaparmakov <aia21@xxxxxxxxxx>

NB. A cleaner solution _might_ be (I am not sure, haven't tried to
write it) to do-your-own for loop instead of list_for_each_safe() and
protect the current inode with a reference and only drop the reference
when the next inode has had a reference obtained on it but this pretty
much amounts to what I do in this patch...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

--- linux-2.6.13-rc1-mm1-vanilla/fs/inotify.c 2005-07-01 14:51:09.000000000 +0100
+++ linux-2.6.13-rc1-mm1/fs/inotify.c 2005-07-04 20:04:31.000000000 +0100
@@ -560,43 +560,65 @@ EXPORT_SYMBOL_GPL(inotify_get_cookie);
*/
void inotify_unmount_inodes(struct list_head *list)
{
- struct inode *inode, *next_i;
+ struct inode *inode, *next_i, *need_iput = NULL;

list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+ struct inode *need_iput_tmp;
struct inotify_watch *watch, *next_w;
struct list_head *watches;

/*
- * We cannot __iget() an inode in state I_CLEAR or I_FREEING,
- * which is fine becayse by that point the inode cannot have
- * any associated watches.
+ * If i_count is zero, the inode cannot have any watches and
+ * doing an __iget/iput with MS_ACTIVE clear would actually
+ * evict all inodes with zero i_count from icache which is
+ * unnecessarily violent and may in fact be illegal to do.
*/
- if (inode->i_state & (I_CLEAR | I_FREEING))
+ if (!atomic_read(&inode->i_count))
continue;
-
- /* In case the remove_watch() drops a reference */
- __iget(inode);
-
/*
- * We can safely drop inode_lock here because the per-sb list
- * of inodes must not change during unmount and iprune_sem
- * keeps shrink_icache_memory() away.
+ * We cannot __iget() an inode in state I_CLEAR, I_FREEING, or
+ * I_WILL_FREE which is fine because by that point the inode
+ * cannot have any associated watches.
+ */
+ if (inode->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE))
+ continue;
+ need_iput_tmp = need_iput;
+ need_iput = NULL;
+ /* In case the remove_watch() drops a reference. */
+ if (inode != need_iput_tmp)
+ __iget(inode);
+ else
+ need_iput_tmp = NULL;
+ /* In case the dropping of a reference would nuke next_i. */
+ if ((&next_i->i_sb_list != list) &&
+ atomic_read(&next_i->i_count) &&
+ !(next_i->i_state & (I_CLEAR | I_FREEING |
+ I_WILL_FREE))) {
+ __iget(next_i);
+ need_iput = next_i;
+ }
+ /*
+ * We can safely drop inode_lock here because we hold
+ * references on both inode and next_i. Also no new inodes
+ * will be added since the umount has begun. Finally,
+ * iprune_sem keeps shrink_icache_memory() away.
*/
spin_unlock(&inode_lock);
-
- /* for each watch, send IN_UNMOUNT and then remove it */
+ if (need_iput_tmp)
+ iput(need_iput_tmp);
+ /* For each watch, send IN_UNMOUNT and then remove it. */
down(&inode->inotify_sem);
watches = &inode->inotify_watches;
list_for_each_entry_safe(watch, next_w, watches, i_list) {
struct inotify_device *dev = watch->dev;
down(&dev->sem);
- inotify_dev_queue_event(dev, watch, IN_UNMOUNT,0,NULL);
+ inotify_dev_queue_event(dev, watch, IN_UNMOUNT, 0,
+ NULL);
remove_watch(watch, dev);
up(&dev->sem);
}
up(&inode->inotify_sem);
iput(inode);
-
spin_lock(&inode_lock);
}
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/