Re: [PATCH 1/4] mm: Introduce revoke_file_mappings.
From: Andrew Morton
Date: Tue Sep 08 2009 - 18:20:14 EST
On Fri, 04 Sep 2009 12:25:28 -0700
ebiederm@xxxxxxxxxxxx (Eric W. Biederman) wrote:
>
> When the backing store of a file becomes inaccessible we need a
> function to remove that file from the page tables and arrange for page
> faults to receive SIGBUS until the file is unmapped.
>
> The current implementation in sysfs almost gets this correct by
> intercepting vm_ops, but fails to call vm_ops->close on revoke and in
> fact does not have quite enough information available to do so. Which
> can result in leaks for any vm_ops that depend on close to drop
> reference counts.
>
> It turns out that revoke_file_mapping is less code and a more straight
> forward solution to the problem (except for the locking), as well as
> being a general solution that can work for any mmapped and is not
> limited to sysfs.
>
> Signed-off-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxxxxxxxx>
> ---
> include/linux/mm.h | 2 +
> mm/memory.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 142 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9a72cc7..eb6cecb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -790,6 +790,8 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
> unmap_mapping_range(mapping, holebegin, holelen, 0);
> }
>
> +extern void revoke_file_mappings(struct file *file);
> +
> extern int vmtruncate(struct inode * inode, loff_t offset);
> extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
>
> diff --git a/mm/memory.c b/mm/memory.c
> index aede2ce..4b47116 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
mm/memory.c is large, messy and ill-defined. I think this new code
would fit nicely into a new mm/revoke.c?
> @@ -55,6 +55,7 @@
> #include <linux/kallsyms.h>
> #include <linux/swapops.h>
> #include <linux/elf.h>
> +#include <linux/file.h>
>
> #include <asm/pgalloc.h>
> #include <asm/uaccess.h>
> @@ -2410,6 +2411,145 @@ void unmap_mapping_range(struct address_space *mapping,
> }
> EXPORT_SYMBOL(unmap_mapping_range);
>
> +static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + return VM_FAULT_SIGBUS;
> +}
> +
> +static struct vm_operations_struct revoked_vm_ops = {
> + .fault = revoked_fault,
> +};
> +
> +static void revoke_one_file_vma(struct file *file,
> + struct mm_struct *mm, unsigned long vma_addr)
It'd be nice to have a comment explaining what this function does, how
it does it, etc.
> +{
> + unsigned long start_addr, end_addr, size;
> + struct vm_area_struct *vma;
> +
> + /*
> + * Must be called with mmap_sem held for write.
> + *
> + * Holding the mmap_sem prevents all vm_ops operations from
> + * being called as well as preventing all other kinds of
> + * modifications to the mm.
Does it? I'm surprised that we'd have 100% coverage for a rule this
broad.
> + */
> +
> + /* Lookup a vma for my file address */
> + vma = find_vma(mm, vma_addr);
> + if (vma->vm_file != file)
What does it mean when this comparison failed?
> + return;
> +
> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;
> + size = end_addr - start_addr;
> +
> + /* Unlock the pages */
A "locked page" in the kernel has a specific meaning, and "being in a
VMA which has VM_LOCKED set" isn't it ;)
> + if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
> + mm->locked_vm -= vma_pages(vma);
> + vma->vm_flags &= ~VM_LOCKED;
So if ->locked_vm happens to be zero, we don't clear VM_LOCKED. Is
that a bug?
> + }
> +
> + /* Unmap the vma */
"Unmap the pages within the vma"
> + zap_page_range(vma, start_addr, size, NULL);
> +
> + /* Unlink the vma from the file */
> + unlink_file_vma(vma);
> +
> + /* Close the vma */
> + if (vma->vm_ops && vma->vm_ops->close)
> + vma->vm_ops->close(vma);
> + fput(vma->vm_file);
> + vma->vm_file = NULL;
> + if (vma->vm_flags & VM_EXECUTABLE)
> + removed_exe_file_vma(vma->vm_mm);
> +
> + /* Repurpose the vma */
> + vma->vm_private_data = NULL;
> + vma->vm_ops = &revoked_vm_ops;
> + vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR | VM_PFNMAP);
geeze, where did this decision come from? Needs explanatory comments?
> +}
> +
> +void revoke_file_mappings(struct file *file)
Again, the semantics and intent of this code can only be divined by
reverse-engineering the implementation. This makes it hard to review
and harder to maintainer.
> +{
> + /* After a file has been marked dead update the vmas */
> + struct address_space *mapping = file->f_mapping;
> + struct vm_area_struct *vma;
> + struct prio_tree_iter iter;
> + unsigned long start_address;
> + struct mm_struct *mm;
> + int mm_users;
> +
> + /*
> + * The locking here is a bit complex.
> + *
> + * - revoke_one_file_vma needs to be able to sleep so it can
> + * call vm_ops->close().
whitespace went funny.
> + *
> + * - i_mmap_lock needs to be held to iterate the list of vmas
> + * for a file.
> + *
> + * - The mm can be exiting when we find the vma on our list.
> + *
> + * - This function can not return until we can guarantee for
> + * all vmas associated with file that no vm_ops method will
> + * be called.
> + *
> + * This code increments mm_users to ensure that the mm will
> + * not go away after it drops i_mmap_lock, and then grabs
> + * mmap_sem for write to block all other modifications to the
> + * mm, before refinding the the vma and removing it.
> + *
> + * If mm_users is already 0 indicated that exit_mmap is
> + * running on the mm the code simply drop the locks and sleeps
> + * giving exit_mmap a chance to finish. If exit_mmap has not
> + * freed our vma when we rescan the list we repeat until it has.
> + */
All the above makes one wonder "in what contexts can this function be
called" and "what are its calling preconditions".
> + spin_lock(&mapping->i_mmap_lock);
> +restart_tree:
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
> + /* Skip quickly over vmas that do not need to be touched */
> + if (vma->vm_file != file)
> + continue;
We check this again in revoke_one_file_vma(). How come? Avoiding
races perhaps? I don't know, and I don't know how to find out.
> + start_address = vma->vm_start;
> + mm = vma->vm_mm;
> + mm_users = atomic_inc_not_zero(&mm->mm_users);
> + spin_unlock(&mapping->i_mmap_lock);
> + if (mm_users) {
> + down_write(&mm->mmap_sem);
> + revoke_one_file_vma(file, mm, start_address);
> + up_write(&mm->mmap_sem);
> + mmput(mm);
> + } else {
> + schedule(); /* wait for exit_mmap to remove the vma */
This doesn't "wait" for anything. It's a no-op unless need_resched()
happens to be true.
> + }
> + spin_lock(&mapping->i_mmap_lock);
> + goto restart_tree;
> + }
> +
> +restart_list:
> + list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
> + /* Skip quickly over vmas that do not need to be touched */
> + if (vma->vm_file != file)
> + continue;
> + start_address = vma->vm_start;
> + mm = vma->vm_mm;
> + mm_users = atomic_inc_not_zero(&mm->mm_users);
> + spin_unlock(&mapping->i_mmap_lock);
> + if (mm_users) {
> + down_write(&mm->mmap_sem);
> + revoke_one_file_vma(file, mm, start_address);
> + up_write(&mm->mmap_sem);
> + mmput(mm);
> + } else {
> + schedule(); /* wait for exit_mmap to remove the vma */
> + }
> + spin_lock(&mapping->i_mmap_lock);
> + goto restart_list;
> + }
> +
> + spin_unlock(&mapping->i_mmap_lock);
> +}
> +
> /**
> * vmtruncate - unmap mappings "freed" by truncate() syscall
> * @inode: inode of the file used
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/