Re: [-stable 3.8.1 performance regression] madvisePOSIX_FADV_DONTNEED

From: Mathieu Desnoyers
Date: Wed Jun 19 2013 - 15:25:22 EST


* Mel Gorman (mgorman@xxxxxxx) wrote:
> On Tue, Jun 18, 2013 at 10:29:26AM +0100, Mel Gorman wrote:
> > On Mon, Jun 17, 2013 at 02:24:59PM -0700, Andrew Morton wrote:
> > > On Mon, 17 Jun 2013 10:13:57 -0400 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> > >
> > > > Hi,
> > > >
> > > > CCing lkml on this,
> > > >
> > > > * Yannick Brosseau (yannick.brosseau@xxxxxxxxx) wrote:
> > > > > Hi all,
> > > > >
> > > > > We discovered a performance regression in recent kernels with LTTng
> > > > > related to the use of fadvise DONTNEED.
> > > > > A call to this syscall is present in the LTTng consumer.
> > > > >
> > > > > The following kernel commit cause the call to fadvise to be sometime
> > > > > really slower.
> > > > >
> > > > > Kernel commit info:
> > > > > mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard
> > > > > all pages
> > > > > main tree: (since 3.9-rc1)
> > > > > commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732
> > > > > stable tree: (since 3.8.1)
> > > > > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d
> > > > >
> > > > > On the workload test, we observe that the call to fadvise takes about
> > > > > 4-5 us before this patch is applied. After applying the patch, The
> > > > > syscall now takes values from 5 us up to 4 ms (4000 us) sometime. The
> > > > > effect on lttng is that the consumer is frozen for this long period
> > > > > which leads to dropped event in the trace.
> > >
> > > That change wasn't terribly efficient - if there are any unpopulated
> > > pages in the range (which is quite likely), fadvise() will now always
> > > call invalidate_mapping_pages() a second time.
> > >
> >
> > I did not view the path as being performance critical and did not anticipate
> > a delay that severe.
>
> Which I should have, schedule_on_each_cpu is documented to be slow.
>
> > The original test case as well was based on
> > sequential IO as part of a backup so I was also generally expecting the
> > range to be populated. I looked at the other users of lru_add_drain_all()
> > but there are fairly few. compaction uses them but only when used via sysfs
> > or proc. ksm uses it but it's not likely to be that noticable. mlock uses
> > it but it's unlikely it is being called frequently so I'm not going to
> > worry performance of lru_add_drain_all() in general. I'll look closer at
> > properly detecting when it's necessarily to call in the fadvise case.
> >
>
> This compile-only tested prototype should detect remaining pages in the rage
> that were not invalidated. This will at least detect unpopulated pages but
> whether it has any impact depends on what lttng is invalidating. If it's
> invalidating a range of per-cpu traces then I doubt this will work because
> there will always be remaining pages. Similarly I wonder if passing down
> the mapping will really help if a large number of cpus are tracing as we
> end up scheduling work on every CPU regardless.

Here is how LTTng consumerd is using POSIX_FADV_DONTNEED. Please let me
know if we are doing something stupid. ;-)

The LTTng consumerd deals with trace packets, which consists of a set of
pages. I'll just discuss how the pages are moved around, leaving out
discussion about synchronization between producer/consumer.

Whenever the Nth trace packet is ready to be written to disk:

1) splice the entire set of pages of trace packet N to disk through a
pipe,
2) sync_file_range on trace packet N-1 with the following flags:
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER
so we wait (blocking) on the _previous_ trace packet to be completely
written to disk.
3) posix_fadvise POSIX_FADV_DONTNEED on trace packet N-1.

There are a couple of comments in my consumerd code explaining why we do
this sequence of steps:

/*
* Give hints to the kernel about how we access the file:
* POSIX_FADV_DONTNEED : we won't re-access data in a near future after
* we write it.
*
* We need to call fadvise again after the file grows because the
* kernel does not seem to apply fadvise to non-existing parts of the
* file.
*
* Call fadvise _after_ having waited for the page writeback to
* complete because the dirty page writeback semantic is not well
* defined. So it can be expected to lead to lower throughput in
* streaming.
*/

Currently, the lttng-consumerd is single-threaded, but we plan to
re-introduce multi-threading, and per-cpu affinity, in a near future.

Yannick will try your patch tomorrow and report whether it improves
performance or not,

Thanks!

Mathieu

>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2c28271..e2bb47e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2176,6 +2176,8 @@ extern int invalidate_partition(struct gendisk *, int);
> #endif
> unsigned long invalidate_mapping_pages(struct address_space *mapping,
> pgoff_t start, pgoff_t end);
> +unsigned long invalidate_mapping_pages_check(struct address_space *mapping,
> + pgoff_t start, pgoff_t end);
>
> static inline void invalidate_remote_inode(struct inode *inode)
> {
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 7e09268..0579e60 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -122,7 +122,9 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> end_index = (endbyte >> PAGE_CACHE_SHIFT);
>
> if (end_index >= start_index) {
> - unsigned long count = invalidate_mapping_pages(mapping,
> + unsigned long nr_remaining;
> +
> + nr_remaining = invalidate_mapping_pages_check(mapping,
> start_index, end_index);
>
> /*
> @@ -131,7 +133,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> * a per-cpu pagevec for a remote CPU. Drain all
> * pagevecs and try again.
> */
> - if (count < (end_index - start_index + 1)) {
> + if (nr_remaining) {
> lru_add_drain_all();
> invalidate_mapping_pages(mapping, start_index,
> end_index);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index c75b736..86cfc2e 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -312,26 +312,16 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
> }
> EXPORT_SYMBOL(truncate_inode_pages);
>
> -/**
> - * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
> - * @mapping: the address_space which holds the pages to invalidate
> - * @start: the offset 'from' which to invalidate
> - * @end: the offset 'to' which to invalidate (inclusive)
> - *
> - * This function only removes the unlocked pages, if you want to
> - * remove all the pages of one inode, you must call truncate_inode_pages.
> - *
> - * invalidate_mapping_pages() will not block on IO activity. It will not
> - * invalidate pages which are dirty, locked, under writeback or mapped into
> - * pagetables.
> - */
> -unsigned long invalidate_mapping_pages(struct address_space *mapping,
> - pgoff_t start, pgoff_t end)
> +static void __invalidate_mapping_pages(struct address_space *mapping,
> + pgoff_t start, pgoff_t end,
> + unsigned long *ret_nr_invalidated,
> + unsigned long *ret_nr_remaining)
> {
> struct pagevec pvec;
> pgoff_t index = start;
> unsigned long ret;
> - unsigned long count = 0;
> + unsigned long nr_invalidated = 0;
> + unsigned long nr_remaining = 0;
> int i;
>
> /*
> @@ -354,8 +344,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (index > end)
> break;
>
> - if (!trylock_page(page))
> + if (!trylock_page(page)) {
> + nr_remaining++;
> continue;
> + }
> WARN_ON(page->index != index);
> ret = invalidate_inode_page(page);
> unlock_page(page);
> @@ -365,17 +357,73 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> */
> if (!ret)
> deactivate_page(page);
> - count += ret;
> + else
> + nr_remaining++;
> + nr_invalidated += ret;
> }
> pagevec_release(&pvec);
> mem_cgroup_uncharge_end();
> cond_resched();
> index++;
> }
> - return count;
> +
> + *ret_nr_invalidated = nr_invalidated;
> + *ret_nr_remaining = nr_remaining;
> }
> EXPORT_SYMBOL(invalidate_mapping_pages);
>
> +/**
> + * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
> + * @mapping: the address_space which holds the pages to invalidate
> + * @start: the offset 'from' which to invalidate
> + * @end: the offset 'to' which to invalidate (inclusive)
> + *
> + * This function only removes the unlocked pages, if you want to
> + * remove all the pages of one inode, you must call truncate_inode_pages.
> + *
> + * invalidate_mapping_pages() will not block on IO activity. It will not
> + * invalidate pages which are dirty, locked, under writeback or mapped into
> + * pagetables.
> + *
> + * Returns the number of pages invalidated
> + */
> +unsigned long invalidate_mapping_pages(struct address_space *mapping,
> + pgoff_t start, pgoff_t end)
> +{
> + unsigned long nr_invalidated, nr_remaining;
> +
> + __invalidate_mapping_pages(mapping, start, end,
> + &nr_invalidated, &nr_remaining);
> +
> + return nr_invalidated;
> +}
> +
> +/**
> + * invalidate_mapping_pages_check - Invalidate all the unlocked pages of one inode and check for remaining pages.
> + * @mapping: the address_space which holds the pages to invalidate
> + * @start: the offset 'from' which to invalidate
> + * @end: the offset 'to' which to invalidate (inclusive)
> + *
> + * This function only removes the unlocked pages, if you want to
> + * remove all the pages of one inode, you must call truncate_inode_pages.
> + *
> + * invalidate_mapping_pages() will not block on IO activity. It will not
> + * invalidate pages which are dirty, locked, under writeback or mapped into
> + * pagetables.
> + *
> + * Returns the number of pages remaining in the invalidated range
> + */
> +unsigned long invalidate_mapping_pages_check(struct address_space *mapping,
> + pgoff_t start, pgoff_t end)
> +{
> + unsigned long nr_invalidated, nr_remaining;
> +
> + __invalidate_mapping_pages(mapping, start, end,
> + &nr_invalidated, &nr_remaining);
> +
> + return nr_remaining;
> +}
> +
> /*
> * This is like invalidate_complete_page(), except it ignores the page's
> * refcount. We do this because invalidate_inode_pages2() needs stronger

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/