Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

From: Minchan Kim
Date: Mon Jan 04 2010 - 00:20:57 EST


Hi, Wu.

On Mon, Jan 4, 2010 at 1:50 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
>
> POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor
> performance: a 16K read will be carried out in 4 _sync_ 1-page reads.
>
> In other places, ra_pages==0 means
> - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
> - some IO error happened
> where multi-page read IO won't help or should be avoided.
>
> POSIX_FADV_RANDOM actually want a different semantics: to disable the
> *heuristic* readahead algorithm, and to use a dumb one which faithfully
> submit read IO for whatever application requests.
>
> So introduce a flag O_RANDOM for POSIX_FADV_RANDOM.
> It will be visible to fcntl(F_GETFL).
>
> Note that the random hint is not likely to help random reads performance
> noticeably. And it may be too permissive on huge request size (its IO
> size is not limited by read_ahead_kb).
>
> In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
> (NFS read) performance of the application increased by 313%!
>
> v3: use O_RANDOM to indicate both read/write access pattern as in
> Â Âposix_fadvise(), although it only takes effect for read() now
> Â Â(proposed by Quentin)
> v2: use O_RANDOM_READ to avoid race conditions (pointed out by Andi)
>
> CC: Nick Piggin <npiggin@xxxxxxx>
> CC: Andi Kleen <andi@xxxxxxxxxxxxxx>
> CC: Steven Whitehouse <swhiteho@xxxxxxxxxx>
> CC: David Howells <dhowells@xxxxxxxxxx>
> CC: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
> CC: Jonathan Corbet <corbet@xxxxxxx>
> CC: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Tested-by: Quentin Barnes <qbarnes+nfs@xxxxxxxxxxxxx>
> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> ---
> Âinclude/asm-generic/fcntl.h | Â Â4 ++++
> Âmm/fadvise.c        Â|  10 +++++++++-
> Âmm/readahead.c       Â|  Â6 ++++++
> Â3 files changed, 19 insertions(+), 1 deletion(-)
>
> --- linux.orig/include/asm-generic/fcntl.h   Â2010-01-04 12:39:29.000000000 +0800
> +++ linux/include/asm-generic/fcntl.h  2010-01-04 12:40:11.000000000 +0800
> @@ -80,6 +80,10 @@
> Â#define O_NDELAY Â Â Â O_NONBLOCK
> Â#endif
>
> +#ifndef O_RANDOM
> +#define O_RANDOM Â Â Â 010000000 Â Â Â /* random access pattern hint */
> +#endif
> +
> Â#define F_DUPFD Â Â Â Â Â Â Â Â0 Â Â Â /* dup */
> Â#define F_GETFD Â Â Â Â Â Â Â Â1 Â Â Â /* get close_on_exec */
> Â#define F_SETFD Â Â Â Â Â Â Â Â2 Â Â Â /* set/clear close_on_exec */
> --- linux.orig/mm/fadvise.c   2010-01-04 12:39:29.000000000 +0800
> +++ linux/mm/fadvise.c Â2010-01-04 12:39:30.000000000 +0800
> @@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
> Â Â Â Âswitch (advice) {
> Â Â Â Âcase POSIX_FADV_NORMAL:
> Â Â Â Â Â Â Â Âfile->f_ra.ra_pages = bdi->ra_pages;
> + Â Â Â Â Â Â Â spin_lock(&file->f_lock);
> + Â Â Â Â Â Â Â file->f_flags &= ~O_RANDOM;
> + Â Â Â Â Â Â Â spin_unlock(&file->f_lock);
> Â Â Â Â Â Â Â Âbreak;
> Â Â Â Âcase POSIX_FADV_RANDOM:
> - Â Â Â Â Â Â Â file->f_ra.ra_pages = 0;
> + Â Â Â Â Â Â Â spin_lock(&file->f_lock);
> + Â Â Â Â Â Â Â file->f_flags |= O_RANDOM;
> + Â Â Â Â Â Â Â spin_unlock(&file->f_lock);
> Â Â Â Â Â Â Â Âbreak;
> Â Â Â Âcase POSIX_FADV_SEQUENTIAL:
> Â Â Â Â Â Â Â Âfile->f_ra.ra_pages = bdi->ra_pages * 2;
> + Â Â Â Â Â Â Â spin_lock(&file->f_lock);
> + Â Â Â Â Â Â Â file->f_flags &= ~O_RANDOM;
> + Â Â Â Â Â Â Â spin_unlock(&file->f_lock);
> Â Â Â Â Â Â Â Âbreak;
> Â Â Â Âcase POSIX_FADV_WILLNEED:
> Â Â Â Â Â Â Â Âif (!mapping->a_ops->readpage) {
> --- linux.orig/mm/readahead.c  2010-01-04 12:39:29.000000000 +0800
> +++ linux/mm/readahead.c    Â2010-01-04 12:39:30.000000000 +0800
> @@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
> Â Â Â Âif (!ra->ra_pages)
> Â Â Â Â Â Â Â Âreturn;
>
> + Â Â Â /* be dumb */
> + Â Â Â if (filp->f_flags & O_RANDOM) {
> + Â Â Â Â Â Â Â force_page_cache_readahead(mapping, filp, offset, req_size);
> + Â Â Â Â Â Â Â return;
> + Â Â Â }
> +

Let me have a dumb question. :)

How about testing O_RANDOM in front of ra_pages testing?

My intention is that although we turn off ra, it would be better to read
contiguous block all at once than readpage() callback doing I/O
one page at a time.

Is it break some semantics or happen some problem in ondemand readahead?

> Â Â Â Â/* do read-ahead */
> Â Â Â Âondemand_readahead(mapping, ra, filp, false, offset, req_size);
> Â}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at Âhttp://vger.kernel.org/majordomo-info.html
> Please read the FAQ at Âhttp://www.tux.org/lkml/
>



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/