Re: [PATCH 2/5] revoke: core code

From: Andrew Morton
Date: Thu Mar 15 2007 - 21:35:22 EST


On Sun, 11 Mar 2007 13:30:49 +0200 (EET) Pekka J Enberg <penberg@xxxxxxxxxxxxxx> wrote:

> From: Pekka Enberg <penberg@xxxxxxxxxxxxxx>
>
> The revokeat(2) and frevoke(2) system calls invalidate open file
> descriptors and shared mappings of an inode. After an successful
> revocation, operations on file descriptors fail with the EBADF or
> ENXIO error code for regular and device files,
> respectively. Attempting to read from or write to a revoked mapping
> causes SIGBUS.
>
> The actual operation is done in two passes:
>
> 1. Revoke all file descriptors that point to the given inode. We do
> this under tasklist_lock so that after this pass, we don't need
> to worry about racing with close(2) or dup(2).
>
> 2. Take down shared memory mappings of the inode and close all file
> pointers.
>
> The file descriptors and memory mapping ranges are preserved until the
> owning task does close(2) and munmap(2), respectively.
>
> ...
>
> +asmlinkage int sys_revokeat(int dfd, const char __user *filename);
> +asmlinkage int sys_frevoke(unsigned int fd);

noooo.... all system calls must return long.

> +static int revoke_vma(struct vm_area_struct *vma, struct zap_details *details)
> +{
> + unsigned long restart_addr, start_addr, end_addr;
> + int need_break;
> +
> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;
> +
> + /*
> + * Not holding ->mmap_sem here.
> + */
> + vma->vm_flags |= VM_REVOKED;

so.... the modification of vm_flags is racy?

> + smp_mb();

Please always document barriers. There's presumably some vm_flags reader
we're concerned about here, but how is the code reader to know what the
code writer was thinking?


> + again:
> + restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr,
> + details);
> +
> + need_break = need_resched() || need_lockbreak(details->i_mmap_lock);
> + if (need_break)
> + goto out_need_break;
> +
> + if (restart_addr < end_addr) {
> + start_addr = restart_addr;
> + goto again;
> + }
> + return 0;
> +
> + out_need_break:
> + spin_unlock(details->i_mmap_lock);
> + cond_resched();
> + spin_lock(details->i_mmap_lock);
> + return -EINTR;
> +}
> +
> +static int revoke_mapping(struct address_space *mapping, struct file *to_exclude)
> +{
> + struct vm_area_struct *vma;
> + struct prio_tree_iter iter;
> + struct zap_details details;
> + int err = 0;
> +
> + details.i_mmap_lock = &mapping->i_mmap_lock;
> +
> + spin_lock(&mapping->i_mmap_lock);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
> + if ((vma->vm_flags & VM_SHARED) && vma->vm_file != to_exclude) {
> + err = revoke_vma(vma, &details);
> + if (err)
> + goto out;
> + }
> + }
> +
> + list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
> + if ((vma->vm_flags & VM_SHARED) && vma->vm_file != to_exclude) {
> + err = revoke_vma(vma, &details);
> + if (err)
> + goto out;
> + }
> + }
> + out:
> + spin_unlock(&mapping->i_mmap_lock);
> + return err;
> +}

This all looks very strange. If the calling process expires its timeslice,
the entire system call fails?

What's happening here?


> +
> +int generic_file_revoke(struct file *file)
> +{
> + int err;
> +
> + /*
> + * Flush pending writes.
> + */
> + err = do_fsync(file, 1);
> + if (err)
> + goto out;
> +
> + /*
> + * Make pending reads fail.
> + */
> + err = invalidate_inode_pages2(file->f_mapping);
> +
> + out:
> + return err;
> +}
> +
> +EXPORT_SYMBOL(generic_file_revoke);

do_fsync() is seriously suboptimal - it will run an ext3 commit.
do_sync_file_range(...,
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)
will run maybe five times quicker.

But otoh, do_sync_file_range() will fail to write back the pages for a
data=journal ext3 file, I expect (oops).


Why is this code using invalidate_inode_pages2()? That function keeps on
breaking, has ill-defined semantics and will probably change in the future.

Exactly what semantics are you looking for here, and why?

The blank line before the EXPORT_SYMBOL() is a waste of space.

> +/*
> + * Filesystem for revoked files.
> + */
> +
> +static struct inode *revokefs_alloc_inode(struct super_block *sb)
> +{
> + struct revokefs_inode_info *info;
> +
> + info = kmem_cache_alloc(revokefs_inode_cache, GFP_NOFS);
> + if (!info)
> + return NULL;
> +
> + return &info->vfs_inode;
> +}

Why GFP_NOFS?

> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ uml-2.6/include/linux/revoked_fs_i.h 2007-03-11 13:09:20.000000000 +0200
> @@ -0,0 +1,20 @@
> +#ifndef _LINUX_REVOKED_FS_I_H
> +#define _LINUX_REVOKED_FS_I_H
> +
> +#define REVOKEFS_MAGIC 0x5245564B /* REVK */

This is supposed to go into magic.h.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/