Re: [PATCH] ummunotify: Userspace support for MMU notifications V2

From: Sayantan Sur
Date: Tue May 11 2010 - 12:17:35 EST


Hi,

I understand that this patch went through to the -mm tree.
MVAPICH/MVAPICH2 MPI stacks intend to utilize this feature as well.

Thanks.

On Thu, Apr 22, 2010 at 6:38 AM, Eric B Munson <ebmunson@xxxxxxxxxx> wrote:
> From: Roland Dreier <rolandd@xxxxxxxxx>
>
> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
> and follow-up messages, libraries using RDMA would like to track
> precisely when application code changes memory mapping via free(),
> munmap(), etc.  Current pure-userspace solutions using malloc hooks
> and other tricks are not robust, and the feeling among experts is that
> the issue is unfixable without kernel help.
>
> We solve this not by implementing the full API proposed in the email
> linked above but rather with a simpler and more generic interface,
> which may be useful in other contexts.  Specifically, we implement a
> new character device driver, ummunotify, that creates a /dev/ummunotify
> node.  A userspace process can open this node read-only and use the fd
> as follows:
>
>  1. ioctl() to register/unregister an address range to watch in the
>     kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).
>
>  2. read() to retrieve events generated when a mapping in a watched
>     address range is invalidated (cf struct ummunotify_event in
>     <linux/ummunotify.h>).  select()/poll()/epoll() and SIGIO are
>     handled for this IO.
>
>  3. mmap() one page at offset 0 to map a kernel page that contains a
>     generation counter that is incremented each time an event is
>     generated.  This allows userspace to have a fast path that checks
>     that no events have occurred without a system call.
>
> Thanks to Jason Gunthorpe <jgunthorpe <at> obsidianresearch.com> for
> suggestions on the interface design.  Also thanks to Jeff Squyres
> <jsquyres <at> cisco.com> for prototyping support for this in Open MPI,
> which
> helped find several bugs during development.
>
> Signed-off-by: Roland Dreier <rolandd@xxxxxxxxx>
> Signed-off-by: Eric B Munson <ebmunson@xxxxxxxxxx>
>
> ---
>
> Changes from V1:
> - Update Kbuild to handle test program build properly
> - Update documentation to cover questions not addressed in previous
>   thread
> ---
>  Documentation/Makefile                  |    3 +-
>  Documentation/ummunotify/Makefile       |    7 +
>  Documentation/ummunotify/ummunotify.txt |  162 +++++++++
>  Documentation/ummunotify/umn-test.c     |  200 +++++++++++
>  drivers/char/Kconfig                    |   12 +
>  drivers/char/Makefile                   |    1 +
>  drivers/char/ummunotify.c               |  567
> +++++++++++++++++++++++++++++++
>  include/linux/Kbuild                    |    1 +
>  include/linux/ummunotify.h              |  121 +++++++
>  9 files changed, 1073 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/ummunotify/Makefile
>  create mode 100644 Documentation/ummunotify/ummunotify.txt
>  create mode 100644 Documentation/ummunotify/umn-test.c
>  create mode 100644 drivers/char/ummunotify.c
>  create mode 100644 include/linux/ummunotify.h
>
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index 6fc7ea1..27ba76a 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -1,3 +1,4 @@
>  obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
>         filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
> -       pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
> +       pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
> +       watchdog/src/
> diff --git a/Documentation/ummunotify/Makefile
> b/Documentation/ummunotify/Makefile
> new file mode 100644
> index 0000000..89f31a0
> --- /dev/null
> +++ b/Documentation/ummunotify/Makefile
> @@ -0,0 +1,7 @@
> +# List of programs to build
> +hostprogs-y := umn-test
> +
> +# Tell kbuild to always build the programs
> +always := $(hostprogs-y)
> +
> +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
> diff --git a/Documentation/ummunotify/ummunotify.txt
> b/Documentation/ummunotify/ummunotify.txt
> new file mode 100644
> index 0000000..d6c2ccc
> --- /dev/null
> +++ b/Documentation/ummunotify/ummunotify.txt
> @@ -0,0 +1,162 @@
> +UMMUNOTIFY
> +
> +  Ummunotify relays MMU notifier events to userspace.  This is useful
> +  for libraries that need to track the memory mapping of applications;
> +  for example, MPI implementations using RDMA want to cache memory
> +  registrations for performance, but tracking all possible crazy cases
> +  such as when, say, the FORTRAN runtime frees memory is impossible
> +  without kernel help.
> +
> +Basic Model
> +
> +  A userspace process uses it by opening /dev/ummunotify, which
> +  returns a file descriptor.  Interest in address ranges is registered
> +  using ioctl() and MMU notifier events are retrieved using read(), as
> +  described in more detail below.  Userspace can register multiple
> +  address ranges to watch, and can unregister individual ranges.
> +
> +  Userspace can also mmap() a single read-only page at offset 0 on
> +  this file descriptor.  This page contains (at offest 0) a single
> +  64-bit generation counter that the kernel increments each time an
> +  MMU notifier event occurs.  Userspace can use this to very quickly
> +  check if there are any events to retrieve without needing to do a
> +  system call.
> +
> +Control
> +
> +  To start using ummunotify, a process opens /dev/ummunotify in
> +  read-only mode.  This will attach to current->mm because the current
> +  consumers of this functionality do all monitoring in the process
> +  being monitored.  It is currently not possible to use this device to
> +  monitor other processes.  Control from userspace is done via ioctl().
> +  An ioctl was chosen because the number of files required to register
> +  a new address range in sysfs would be unwieldy and new procfs entries
> +  are discouraged.  The defined ioctls are:
> +
> +    UMMUNOTIFY_EXCHANGE_FEATURES: This ioctl takes a single 32-bit
> +      word of feature flags as input, and the kernel updates the
> +      features flags word to contain only features requested by
> +      userspace and also supported by the kernel.
> +
> +      This ioctl is only included for forward compatibility; no
> +      feature flags are currently defined, and the kernel will simply
> +      update any requested feature mask to 0.  The kernel will always
> +      default to a feature mask of 0 if this ioctl is not used, so
> +      current userspace does not need to perform this ioctl.
> +
> +    UMMUNOTIFY_REGISTER_REGION: Userspace uses this ioctl to tell the
> +      kernel to start delivering events for an address range.  The
> +      range is described using struct ummunotify_register_ioctl:
> +
> +       struct ummunotify_register_ioctl {
> +               __u64   start;
> +               __u64   end;
> +               __u64   user_cookie;
> +               __u32   flags;
> +               __u32   reserved;
> +       };
> +
> +      start and end give the range of userspace virtual addresses;
> +      start is included in the range and end is not, so an example of
> +      a 4 KB range would be start=0x1000, end=0x2000.
> +
> +      user_cookie is an opaque 64-bit quantity that is returned by the
> +      kernel in events involving the range, and used by userspace to
> +      stop watching the range.  Each registered address range must
> +      have a distinct user_cookie.
> +
> +      It is fine with the kernel if userspace registers multiple
> +      overlapping or even duplicate address ranges, as long as a
> +      different cookie is used for each registration.
> +
> +      flags and reserved are included for forward compatibility;
> +      userspace should simply set them to 0 for the current interface.
> +
> +    UMMUNOTIFY_UNREGISTER_REGION: Userspace passes in the 64-bit
> +      user_cookie used to register a range to tell the kernel to stop
> +      watching an address range.  Once this ioctl completes, the
> +      kernel will not deliver any further events for the range that is
> +      unregistered.
> +
> +Events
> +
> +  When an event occurs that invalidates some of a process's memory
> +  mapping in an address range being watched, ummunotify queues an
> +  event report for that address range.  If more than one event
> +  invalidates parts of the same address range before userspace
> +  retrieves the queued report, then further reports for the same range
> +  will not be queued -- when userspace does read the queue, only a
> +  single report for a given range will be returned.
> +
> +  If multiple ranges being watched are invalidated by a single event
> +  (which is especially likely if userspace registers overlapping
> +  ranges), then an event report structure will be queued for each
> +  address range registration.
> +
> +  It is possible, if a large enough number of overlapping ranges are
> +  registered and the list of invalidated events is busy enough and
> +  ignored long enough, to cause the kernel to run out of memory.
> +  Because this situation is unlikely to occur, the event queue size
> +  is not bounded in order to avoid dropping events if the queue grows
> +  beyond set bounds.
> +
> +  Userspace retrieves queued events via read() on the ummunotify file
> +  descriptor; a buffer that is at least as big as struct
> +  ummunotify_event should be used to retrieve event reports, and if a
> +  larger buffer is passed to read(), multiple reports will be returned
> +  (if available).
> +
> +  If the ummunotify file descriptor is in blocking mode, a read() call
> +  will wait for an event report to be available.  Userspace may also
> +  set the ummunotify file descriptor to non-blocking mode and use all
> +  standard ways of waiting for data to be available on the ummunotify
> +  file descriptor, including epoll/poll()/select() and SIGIO.
> +
> +  The format of event reports is:
> +
> +       struct ummunotify_event {
> +               __u32   type;
> +               __u32   flags;
> +               __u64   hint_start;
> +               __u64   hint_end;
> +               __u64   user_cookie_counter;
> +       };
> +
> +  where the type field is either UMMUNOTIFY_EVENT_TYPE_INVAL or
> +  UMMUNOTIFY_EVENT_TYPE_LAST.  Events of type INVAL describe
> +  invalidation events as follows: user_cookie_counter contains the
> +  cookie passed in when userspace registered the range that the event
> +  is for.  hint_start and hint_end contain the start address and end
> +  address that were invalidated.
> +
> +  The flags word contains bit flags, with only UMMUNOTIFY_EVENT_FLAG_HINT
> +  defined at the moment.  If HINT is set, then the invalidation event
> +  invalidated less than the full address range and the kernel returns
> +  the exact range invalidated; if HINT is not sent then hint_start and
> +  hint_end are set to the original range registered by userspace.
> +  (HINT will not be set if, for example, multiple events invalidated
> +  disjoint parts of the range and so a single start/end pair cannot
> +  represent the parts of the range that were invalidated)
> +
> +  If the event type is LAST, then the read operation has emptied the
> +  list of invalidated regions, and the flags, hint_start and hint_end
> +  fields are not used.  user_cookie_counter holds the value of the
> +  kernel's generation counter (see below of more details) when the
> +  empty list occurred.
> +
> +Generation Count
> +
> +  Userspace may mmap() a page on a ummunotify file descriptor via
> +
> +       mmap(NULL, sizeof (__u64), PROT_READ, MAP_SHARED, ummunotify_fd, 0);
> +
> +  to get a read-only mapping of the kernel's 64-bit generation
> +  counter.  The kernel will increment this generation counter each
> +  time an event report is queued.
> +
> +  Userspace can use the generation counter as a quick check to avoid
> +  system calls; if the value read from the mapped kernel counter is
> +  still equal to the value returned in user_cookie_counter for the
> +  most recent LAST event retrieved, then no further events have been
> +  queued and there is no need to try a read() on the ummunotify file
> +  descriptor.
> diff --git a/Documentation/ummunotify/umn-test.c
> b/Documentation/ummunotify/umn-test.c
> new file mode 100644
> index 0000000..143db2c
> --- /dev/null
> +++ b/Documentation/ummunotify/umn-test.c
> @@ -0,0 +1,200 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <stdint.h>
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#include <linux/ummunotify.h>
> +
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <sys/ioctl.h>
> +
> +#define UMN_TEST_COOKIE 123
> +
> +static int             umn_fd;
> +static volatile __u64  *umn_counter;
> +
> +static int umn_init(void)
> +{
> +       __u32 flags;
> +
> +       umn_fd = open("/dev/ummunotify", O_RDONLY);
> +       if (umn_fd < 0) {
> +               perror("open");
> +               return 1;
> +       }
> +
> +       if (ioctl(umn_fd, UMMUNOTIFY_EXCHANGE_FEATURES, &flags)) {
> +               perror("exchange ioctl");
> +               return 1;
> +       }
> +
> +       printf("kernel feature flags: 0x%08x\n", flags);
> +
> +       umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ,
> +                          MAP_SHARED, umn_fd, 0);
> +       if (umn_counter == MAP_FAILED) {
> +               perror("mmap");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int umn_register(void *buf, size_t size, __u64 cookie)
> +{
> +       struct ummunotify_register_ioctl r = {
> +               .start          = (unsigned long) buf,
> +               .end            = (unsigned long) buf + size,
> +               .user_cookie    = cookie,
> +       };
> +
> +       if (ioctl(umn_fd, UMMUNOTIFY_REGISTER_REGION, &r)) {
> +               perror("register ioctl");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int umn_unregister(__u64 cookie)
> +{
> +       if (ioctl(umn_fd, UMMUNOTIFY_UNREGISTER_REGION, &cookie)) {
> +               perror("unregister ioctl");
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       int                     page_size;
> +       __u64                   old_counter;
> +       void                   *t;
> +       int                     got_it;
> +
> +       if (umn_init())
> +               return 1;
> +
> +       printf("\n");
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 0) {
> +               fprintf(stderr, "counter = %lld (expected 0)\n",
> old_counter);
> +               return 1;
> +       }
> +
> +       page_size = sysconf(_SC_PAGESIZE);
> +       t = mmap(NULL, 3 * page_size, PROT_READ,
> +                MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> +
> +       if (umn_register(t, 3 * page_size, UMN_TEST_COOKIE))
> +               return 1;
> +
> +       munmap(t + page_size, page_size);
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 1) {
> +               fprintf(stderr, "counter = %lld (expected 1)\n",
> old_counter);
> +               return 1;
> +       }
> +
> +       got_it = 0;
> +       while (1) {
> +               struct ummunotify_event ev;
> +               int                     len;
> +
> +               len = read(umn_fd, &ev, sizeof ev);
> +               if (len < 0) {
> +                       perror("read event");
> +                       return 1;
> +               }
> +               if (len != sizeof ev) {
> +                       fprintf(stderr, "Read gave %d bytes (!= event size
> %zd)\n",
> +                               len, sizeof ev);
> +                       return 1;
> +               }
> +
> +               switch (ev.type) {
> +               case UMMUNOTIFY_EVENT_TYPE_INVAL:
> +                       if (got_it) {
> +                               fprintf(stderr, "Extra invalidate event\n");
> +                               return 1;
> +                       }
> +                       if (ev.user_cookie_counter != UMN_TEST_COOKIE) {
> +                               fprintf(stderr, "Invalidate event for cookie
> %lld (expected %d)\n",
> +                                       ev.user_cookie_counter,
> +                                       UMN_TEST_COOKIE);
> +                               return 1;
> +                       }
> +
> +                       printf("Invalidate event:\tcookie %lld\n",
> +                              ev.user_cookie_counter);
> +
> +                       if (!(ev.flags & UMMUNOTIFY_EVENT_FLAG_HINT)) {
> +                               fprintf(stderr, "Hint flag not set\n");
> +                               return 1;
> +                       }
> +
> +                       if (ev.hint_start != (uintptr_t) t + page_size ||
> +                           ev.hint_end != (uintptr_t) t + page_size * 2) {
> +                               fprintf(stderr, "Got hint %llx..%llx,
> expected %p..%p\n",
> +                                       ev.hint_start, ev.hint_end,
> +                                       t + page_size, t + page_size * 2);
> +                               return 1;
> +                       }
> +
> +                       printf("\t\t\thint %llx...%llx\n",
> +                              ev.hint_start, ev.hint_end);
> +
> +                       got_it = 1;
> +                       break;
> +
> +               case UMMUNOTIFY_EVENT_TYPE_LAST:
> +                       if (!got_it) {
> +                               fprintf(stderr, "Last event without
> invalidate event\n");
> +                               return 1;
> +                       }
> +
> +                       printf("Empty event:\t\tcounter %lld\n",
> +                              ev.user_cookie_counter);
> +                       goto done;
> +
> +               default:
> +                       fprintf(stderr, "unknown event type %d\n",
> +                               ev.type);
> +                       return 1;
> +               }
> +       }
> +
> +done:
> +       umn_unregister(123);
> +       munmap(t, page_size);
> +
> +       old_counter = *umn_counter;
> +       if (old_counter != 1) {
> +               fprintf(stderr, "counter = %lld (expected 1)\n",
> old_counter);
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
> index 3141dd3..cf26019 100644
> --- a/drivers/char/Kconfig
> +++ b/drivers/char/Kconfig
> @@ -1111,6 +1111,18 @@ config DEVPORT
>         depends on ISA || PCI
>         default y
>
> +config UMMUNOTIFY
> +       tristate "Userspace MMU notifications"
> +       select MMU_NOTIFIER
> +       help
> +         The ummunotify (userspace MMU notification) driver creates a
> +         character device that can be used by userspace libraries to
> +         get notifications when an application's memory mapping
> +         changed.  This is used, for example, by RDMA libraries to
> +         improve the reliability of memory registration caching, since
> +         the kernel's MMU notifications can be used to know precisely
> +         when to shoot down a cached registration.
> +
>  source "drivers/s390/char/Kconfig"
>
>  endmenu
> diff --git a/drivers/char/Makefile b/drivers/char/Makefile
> index f957edf..521e5de 100644
> --- a/drivers/char/Makefile
> +++ b/drivers/char/Makefile
> @@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO)                += nsc_gpio.o
>  obj-$(CONFIG_CS5535_GPIO)      += cs5535_gpio.o
>  obj-$(CONFIG_GPIO_TB0219)      += tb0219.o
>  obj-$(CONFIG_TELCLOCK)         += tlclk.o
> +obj-$(CONFIG_UMMUNOTIFY)       += ummunotify.o
>
>  obj-$(CONFIG_MWAVE)            += mwave/
>  obj-$(CONFIG_AGP)              += agp/
> diff --git a/drivers/char/ummunotify.c b/drivers/char/ummunotify.c
> new file mode 100644
> index 0000000..c14df3f
> --- /dev/null
> +++ b/drivers/char/ummunotify.c
> @@ -0,0 +1,567 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/rbtree.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/uaccess.h>
> +#include <linux/ummunotify.h>
> +
> +#include <asm/cacheflush.h>
> +
> +MODULE_AUTHOR("Roland Dreier");
> +MODULE_DESCRIPTION("Userspace MMU notifiers");
> +MODULE_LICENSE("GPL v2");
> +
> +/*
> + * Information about an address range userspace has asked us to watch.
> + *
> + * user_cookie: Opaque cookie given to us when userspace registers the
> + *   address range.
> + *
> + * start, end: Address range; start is inclusive, end is exclusive.
> + *
> + * hint_start, hint_end: If a single MMU notification event
> + *   invalidates the address range, we hold the actual range of
> + *   addresses that were invalidated (and set UMMUNOTIFY_FLAG_HINT).
> + *   If another event hits this range before userspace reads the
> + *   event, we give up and don't try to keep track of which subsets
> + *   got invalidated.
> + *
> + * flags: Holds the INVALID flag for ranges that are on the invalid
> + *   list and/or the HINT flag for ranges where the hint range holds
> + *   good information.
> + *
> + * node: Used to put the range into an rbtree we use to be able to
> + *   scan address ranges in order.
> + *
> + * list: Used to put the range on the invalid list when an MMU
> + *   notification event hits the range.
> + */
> +enum {
> +       UMMUNOTIFY_FLAG_INVALID = 1,
> +       UMMUNOTIFY_FLAG_HINT    = 2,
> +};
> +
> +struct ummunotify_reg {
> +       u64                     user_cookie;
> +       unsigned long           start;
> +       unsigned long           end;
> +       unsigned long           hint_start;
> +       unsigned long           hint_end;
> +       unsigned long           flags;
> +       struct rb_node          node;
> +       struct list_head        list;
> +};
> +
> +/*
> + * Context attached to each file that userspace opens.
> + *
> + * mmu_notifier: MMU notifier registered for this context.
> + *
> + * mm: mm_struct for process that created the context; we use this to
> + *   hold a reference to the mm to make sure it doesn't go away until
> + *   we're done with it.
> + *
> + * reg_tree: RB tree of address ranges being watched, sorted by start
> + *   address.
> + *
> + * invalid_list: List of address ranges that have been invalidated by
> + *   MMU notification events; as userspace reads events, the address
> + *   range corresponding to the event is removed from the list.
> + *
> + * counter: Page that can be mapped read-only by userspace, which
> + *   holds a generation count that is incremented each time an event
> + *   occurs.
> + *
> + * lock: Spinlock used to protect all context.
> + *
> + * read_wait: Wait queue used to wait for data to become available in
> + *   blocking read()s.
> + *
> + * async_queue: Used to implement fasync().
> + *
> + * need_empty: Set when userspace reads an invalidation event, so that
> + *   read() knows it must generate an "empty" event when userspace
> + *   drains the invalid_list.
> + *
> + * used: Set after userspace does anything with the file, so that the
> + *   "exchange flags" ioctl() knows it's too late to change anything.
> + */
> +struct ummunotify_file {
> +       struct mmu_notifier     mmu_notifier;
> +       struct mm_struct       *mm;
> +       struct rb_root          reg_tree;
> +       struct list_head        invalid_list;
> +       u64                    *counter;
> +       spinlock_t              lock;
> +       wait_queue_head_t       read_wait;
> +       struct fasync_struct   *async_queue;
> +       int                     need_empty;
> +       int                     used;
> +};
> +
> +static void ummunotify_handle_notify(struct mmu_notifier *mn,
> +                                    unsigned long start, unsigned long end)
> +{
> +       struct ummunotify_file *priv =
> +               container_of(mn, struct ummunotify_file, mmu_notifier);
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +       unsigned long flags;
> +       int hit = 0;
> +
> +       spin_lock_irqsave(&priv->lock, flags);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +
> +               /*
> +                * Ranges overlap if they're not disjoint; and they're
> +                * disjoint if the end of one is before the start of
> +                * the other one.  So if both disjointness comparisons
> +                * fail then the ranges overlap.
> +                *
> +                * Since we keep the tree of regions we're watching
> +                * sorted by start address, we can end this loop as
> +                * soon as we hit a region that starts past the end of
> +                * the range for the event we're handling.
> +                */
> +               if (reg->start >= end)
> +                       break;
> +
> +               /*
> +                * Just go to the next region if the start of the
> +                * range is after the end of the region -- there
> +                * might still be more overlapping ranges that have a
> +                * greater start.
> +                */
> +               if (start >= reg->end)
> +                       continue;
> +
> +               hit = 1;
> +
> +               if (test_and_set_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags))
> {
> +                       /* Already on invalid list */
> +                       clear_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
> +               } else {
> +                       list_add_tail(&reg->list, &priv->invalid_list);
> +                       set_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
> +                       reg->hint_start = start;
> +                       reg->hint_end   = end;
> +               }
> +       }
> +
> +       if (hit) {
> +               ++(*priv->counter);
> +               flush_dcache_page(virt_to_page(priv->counter));
> +               wake_up_interruptible(&priv->read_wait);
> +               kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
> +       }
> +
> +       spin_unlock_irqrestore(&priv->lock, flags);
> +}
> +
> +static void ummunotify_invalidate_page(struct mmu_notifier *mn,
> +                                      struct mm_struct *mm,
> +                                      unsigned long addr)
> +{
> +       ummunotify_handle_notify(mn, addr, addr + PAGE_SIZE);
> +}
> +
> +static void ummunotify_invalidate_range_start(struct mmu_notifier *mn,
> +                                             struct mm_struct *mm,
> +                                             unsigned long start,
> +                                             unsigned long end)
> +{
> +       ummunotify_handle_notify(mn, start, end);
> +}
> +
> +static const struct mmu_notifier_ops ummunotify_mmu_notifier_ops = {
> +       .invalidate_page        = ummunotify_invalidate_page,
> +       .invalidate_range_start = ummunotify_invalidate_range_start,
> +};
> +
> +static int ummunotify_open(struct inode *inode, struct file *filp)
> +{
> +       struct ummunotify_file *priv;
> +       int ret;
> +
> +       if (filp->f_mode & FMODE_WRITE)
> +               return -EINVAL;
> +
> +       priv = kmalloc(sizeof *priv, GFP_KERNEL);
> +       if (!priv)
> +               return -ENOMEM;
> +
> +       priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
> +       if (!priv->counter) {
> +               ret = -ENOMEM;
> +               goto err;
> +       }
> +
> +       priv->reg_tree = RB_ROOT;
> +       INIT_LIST_HEAD(&priv->invalid_list);
> +       spin_lock_init(&priv->lock);
> +       init_waitqueue_head(&priv->read_wait);
> +       priv->async_queue = NULL;
> +       priv->need_empty  = 0;
> +       priv->used        = 0;
> +
> +       priv->mmu_notifier.ops = &ummunotify_mmu_notifier_ops;
> +       /*
> +        * Register notifier last, since notifications can occur as
> +        * soon as we register....
> +        */
> +       ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
> +       if (ret)
> +               goto err_page;
> +
> +       priv->mm = current->mm;
> +       atomic_inc(&priv->mm->mm_count);
> +
> +       filp->private_data = priv;
> +
> +       return 0;
> +
> +err_page:
> +       free_page((unsigned long) priv->counter);
> +
> +err:
> +       kfree(priv);
> +       return ret;
> +}
> +
> +static int ummunotify_close(struct inode *inode, struct file *filp)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +
> +       mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
> +       mmdrop(priv->mm);
> +       free_page((unsigned long) priv->counter);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +               kfree(reg);
> +       }
> +
> +       kfree(priv);
> +
> +       return 0;
> +}
> +
> +static bool ummunotify_readable(struct ummunotify_file *priv)
> +{
> +       return priv->need_empty || !list_empty(&priv->invalid_list);
> +}
> +
> +static ssize_t ummunotify_read(struct file *filp, char __user *buf,
> +                              size_t count, loff_t *pos)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       struct ummunotify_reg *reg;
> +       ssize_t ret;
> +       struct ummunotify_event *events;
> +       int max;
> +       int n;
> +
> +       priv->used = 1;
> +
> +       events = (void *) get_zeroed_page(GFP_KERNEL);
> +       if (!events) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       while (!ummunotify_readable(priv)) {
> +               spin_unlock_irq(&priv->lock);
> +
> +               if (filp->f_flags & O_NONBLOCK) {
> +                       ret = -EAGAIN;
> +                       goto out;
> +               }
> +
> +               if (wait_event_interruptible(priv->read_wait,
> +                                            ummunotify_readable(priv))) {
> +                       ret = -ERESTARTSYS;
> +                       goto out;
> +               }
> +
> +               spin_lock_irq(&priv->lock);
> +       }
> +
> +       max = min_t(size_t, PAGE_SIZE, count) / sizeof *events;
> +
> +       for (n = 0; n < max; ++n) {
> +               if (list_empty(&priv->invalid_list)) {
> +                       events[n].type = UMMUNOTIFY_EVENT_TYPE_LAST;
> +                       events[n].user_cookie_counter = *priv->counter;
> +                       ++n;
> +                       priv->need_empty = 0;
> +                       break;
> +               }
> +
> +               reg = list_first_entry(&priv->invalid_list,
> +                                      struct ummunotify_reg, list);
> +
> +               events[n].type = UMMUNOTIFY_EVENT_TYPE_INVAL;
> +               if (test_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags)) {
> +                       events[n].flags      = UMMUNOTIFY_EVENT_FLAG_HINT;
> +                       events[n].hint_start = max(reg->start,
> reg->hint_start);
> +                       events[n].hint_end   = min(reg->end, reg->hint_end);
> +               } else {
> +                       events[n].hint_start = reg->start;
> +                       events[n].hint_end   = reg->end;
> +               }
> +               events[n].user_cookie_counter = reg->user_cookie;
> +
> +               list_del(&reg->list);
> +               reg->flags = 0;
> +               priv->need_empty = 1;
> +       }
> +
> +       spin_unlock_irq(&priv->lock);
> +
> +       if (copy_to_user(buf, events, n * sizeof *events))
> +               ret = -EFAULT;
> +       else
> +               ret = n * sizeof *events;
> +
> +out:
> +       free_page((unsigned long) events);
> +       return ret;
> +}
> +
> +static unsigned int ummunotify_poll(struct file *filp,
> +                                   struct poll_table_struct *wait)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       poll_wait(filp, &priv->read_wait, wait);
> +
> +       return ummunotify_readable(priv) ? (POLLIN | POLLRDNORM) : 0;
> +}
> +
> +static long ummunotify_exchange_features(struct ummunotify_file *priv,
> +                                        __u32 __user *arg)
> +{
> +       u32 feature_mask;
> +
> +       if (priv->used)
> +               return -EINVAL;
> +
> +       priv->used = 1;
> +
> +       if (copy_from_user(&feature_mask, arg, sizeof(feature_mask)))
> +               return -EFAULT;
> +
> +       /* No extensions defined at present. */
> +       feature_mask = 0;
> +
> +       if (copy_to_user(arg, &feature_mask, sizeof(feature_mask)))
> +               return -EFAULT;
> +
> +       return 0;
> +}
> +
> +static long ummunotify_register_region(struct ummunotify_file *priv,
> +                                      void __user *arg)
> +{
> +       struct ummunotify_register_ioctl parm;
> +       struct ummunotify_reg *reg, *treg;
> +       struct rb_node **n = &priv->reg_tree.rb_node;
> +       struct rb_node *pn;
> +       int ret = 0;
> +
> +       if (copy_from_user(&parm, arg, sizeof parm))
> +               return -EFAULT;
> +
> +       priv->used = 1;
> +
> +       reg = kmalloc(sizeof *reg, GFP_KERNEL);
> +       if (!reg)
> +               return -ENOMEM;
> +
> +       reg->user_cookie        = parm.user_cookie;
> +       reg->start              = parm.start;
> +       reg->end                = parm.end;
> +       reg->flags              = 0;
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) {
> +               treg = rb_entry(pn, struct ummunotify_reg, node);
> +
> +               if (treg->user_cookie == parm.user_cookie) {
> +                       kfree(reg);
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +       }
> +
> +       pn = NULL;
> +       while (*n) {
> +               pn = *n;
> +               treg = rb_entry(pn, struct ummunotify_reg, node);
> +
> +               if (reg->start <= treg->start)
> +                       n = &pn->rb_left;
> +               else
> +                       n = &pn->rb_right;
> +       }
> +
> +       rb_link_node(&reg->node, pn, n);
> +       rb_insert_color(&reg->node, &priv->reg_tree);
> +
> +out:
> +       spin_unlock_irq(&priv->lock);
> +
> +       return ret;
> +}
> +
> +static long ummunotify_unregister_region(struct ummunotify_file *priv,
> +                                        __u64 __user *arg)
> +{
> +       u64 user_cookie;
> +       struct rb_node *n;
> +       struct ummunotify_reg *reg;
> +       int ret = -EINVAL;
> +
> +       if (copy_from_user(&user_cookie, arg, sizeof(user_cookie)))
> +               return -EFAULT;
> +
> +       spin_lock_irq(&priv->lock);
> +
> +       for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +               reg = rb_entry(n, struct ummunotify_reg, node);
> +
> +               if (reg->user_cookie == user_cookie) {
> +                       rb_erase(n, &priv->reg_tree);
> +                       if (test_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags))
> +                               list_del(&reg->list);
> +                       kfree(reg);
> +                       ret = 0;
> +                       break;
> +               }
> +       }
> +
> +       spin_unlock_irq(&priv->lock);
> +
> +       return ret;
> +}
> +
> +static long ummunotify_ioctl(struct file *filp, unsigned int cmd,
> +                            unsigned long arg)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +       void __user *argp = (void __user *) arg;
> +
> +       switch (cmd) {
> +       case UMMUNOTIFY_EXCHANGE_FEATURES:
> +               return ummunotify_exchange_features(priv, argp);
> +       case UMMUNOTIFY_REGISTER_REGION:
> +               return ummunotify_register_region(priv, argp);
> +       case UMMUNOTIFY_UNREGISTER_REGION:
> +               return ummunotify_unregister_region(priv, argp);
> +       default:
> +               return -ENOIOCTLCMD;
> +       }
> +}
> +
> +static int ummunotify_fault(struct vm_area_struct *vma, struct vm_fault
> *vmf)
> +{
> +       struct ummunotify_file *priv = vma->vm_private_data;
> +
> +       if (vmf->pgoff != 0)
> +               return VM_FAULT_SIGBUS;
> +
> +       vmf->page = virt_to_page(priv->counter);
> +       get_page(vmf->page);
> +
> +       return 0;
> +
> +}
> +
> +static struct vm_operations_struct ummunotify_vm_ops = {
> +       .fault          = ummunotify_fault,
> +};
> +
> +static int ummunotify_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff != 0)
> +               return -EINVAL;
> +
> +       vma->vm_ops             = &ummunotify_vm_ops;
> +       vma->vm_private_data    = priv;
> +
> +       return 0;
> +}
> +
> +static int ummunotify_fasync(int fd, struct file *filp, int on)
> +{
> +       struct ummunotify_file *priv = filp->private_data;
> +
> +       return fasync_helper(fd, filp, on, &priv->async_queue);
> +}
> +
> +static const struct file_operations ummunotify_fops = {
> +       .owner          = THIS_MODULE,
> +       .open           = ummunotify_open,
> +       .release        = ummunotify_close,
> +       .read           = ummunotify_read,
> +       .poll           = ummunotify_poll,
> +       .unlocked_ioctl = ummunotify_ioctl,
> +#ifdef CONFIG_COMPAT
> +       .compat_ioctl   = ummunotify_ioctl,
> +#endif
> +       .mmap           = ummunotify_mmap,
> +       .fasync         = ummunotify_fasync,
> +};
> +
> +static struct miscdevice ummunotify_misc = {
> +       .minor  = MISC_DYNAMIC_MINOR,
> +       .name   = "ummunotify",
> +       .fops   = &ummunotify_fops,
> +};
> +
> +static int __init ummunotify_init(void)
> +{
> +       return misc_register(&ummunotify_misc);
> +}
> +
> +static void __exit ummunotify_cleanup(void)
> +{
> +       misc_deregister(&ummunotify_misc);
> +}
> +
> +module_init(ummunotify_init);
> +module_exit(ummunotify_cleanup);
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index e2ea0b2..e086b39 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -163,6 +163,7 @@ header-y += tipc_config.h
>  header-y += toshiba.h
>  header-y += udf_fs_i.h
>  header-y += ultrasound.h
> +header-y += ummunotify.h
>  header-y += un.h
>  header-y += utime.h
>  header-y += veth.h
> diff --git a/include/linux/ummunotify.h b/include/linux/ummunotify.h
> new file mode 100644
> index 0000000..21b0d03
> --- /dev/null
> +++ b/include/linux/ummunotify.h
> @@ -0,0 +1,121 @@
> +/*
> + * Copyright (c) 2009 Cisco Systems.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifndef _LINUX_UMMUNOTIFY_H
> +#define _LINUX_UMMUNOTIFY_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +/*
> + * Ummunotify relays MMU notifier events to userspace.  A userspace
> + * process uses it by opening /dev/ummunotify, which returns a file
> + * descriptor.  Interest in address ranges is registered using ioctl()
> + * and MMU notifier events are retrieved using read(), as described in
> + * more detail below.
> + *
> + * Userspace can also mmap() a single read-only page at offset 0 on
> + * this file descriptor.  This page contains (at offest 0) a single
> + * 64-bit generation counter that the kernel increments each time an
> + * MMU notifier event occurs.  Userspace can use this to very quickly
> + * check if there are any events to retrieve without needing to do a
> + * system call.
> + */
> +
> +/*
> + * struct ummunotify_register_ioctl describes an address range from
> + * start to end (including start but not including end) to be
> + * monitored.  user_cookie is an opaque handle that userspace assigns,
> + * and which is used to unregister.  flags and reserved are currently
> + * unused and should be set to 0 for forward compatibility.
> + */
> +struct ummunotify_register_ioctl {
> +       __u64   start;
> +       __u64   end;
> +       __u64   user_cookie;
> +       __u32   flags;
> +       __u32   reserved;
> +};
> +
> +#define UMMUNOTIFY_MAGIC               'U'
> +
> +/*
> + * Forward compatibility: Userspace passes in a 32-bit feature mask
> + * with feature flags set indicating which extensions it wishes to
> + * use.  The kernel will return a feature mask with the bits of
> + * userspace's mask that the kernel implements; from that point on
> + * both userspace and the kernel should behave as described by the
> + * kernel's feature mask.
> + *
> + * If userspace does not perform a UMMUNOTIFY_EXCHANGE_FEATURES ioctl,
> + * then the kernel will use a feature mask of 0.
> + *
> + * No feature flags are currently defined, so the kernel will always
> + * return a feature mask of 0 at present.
> + */
> +#define UMMUNOTIFY_EXCHANGE_FEATURES   _IOWR(UMMUNOTIFY_MAGIC, 1, __u32)
> +
> +/*
> + * Register interest in an address range; userspace should pass in a
> + * struct ummunotify_register_ioctl describing the region.
> + */
> +#define UMMUNOTIFY_REGISTER_REGION     _IOW(UMMUNOTIFY_MAGIC, 2, \
> +                                            struct
> ummunotify_register_ioctl)
> +/*
> + * Unregister interest in an address range; userspace should pass in
> + * the user_cookie value that was used to register the address range.
> + * No events for the address range will be reported once it is
> + * unregistered.
> + */
> +#define UMMUNOTIFY_UNREGISTER_REGION   _IOW(UMMUNOTIFY_MAGIC, 3, __u64)
> +
> +/*
> + * Invalidation events are returned whenever the kernel changes the
> + * mapping for a monitored address.  These events are retrieved by
> + * read() on the ummunotify file descriptor, which will fill the
> + * read() buffer with struct ummunotify_event.
> + *
> + * If type field is INVAL, then user_cookie_counter holds the
> + * user_cookie for the region being reported; if the HINT flag is set
> + * then hint_start/hint_end hold the start and end of the mapping that
> + * was invalidated.  (If HINT is not set, then multiple events
> + * invalidated parts of the registered range and hint_start/hint_end
> + * and set to the start/end of the whole registered range)
> + *
> + * If type is LAST, then the read operation has emptied the list of
> + * invalidated regions, and user_cookie_counter holds the value of the
> + * kernel's generation counter when the empty list occurred.  The
> + * other fields are not filled in for this event.
> + */
> +enum {
> +       UMMUNOTIFY_EVENT_TYPE_INVAL     = 0,
> +       UMMUNOTIFY_EVENT_TYPE_LAST      = 1,
> +};
> +
> +enum {
> +       UMMUNOTIFY_EVENT_FLAG_HINT      = 1 << 0,
> +};
> +
> +struct ummunotify_event {
> +       __u32   type;
> +       __u32   flags;
> +       __u64   hint_start;
> +       __u64   hint_end;
> +       __u64   user_cookie_counter;
> +};
> +
> +#endif /* _LINUX_UMMUNOTIFY_H */
> --
> 1.6.3.3
>
>



--
Sayantan Sur

Research Scientist
Department of Computer Science
The Ohio State University.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/