clustering page-ins

Chuck Lever (cel@monkey.org)
Mon, 19 Jul 1999 16:24:21 -0400 (EDT)


On Mon, 19 Jul 1999, Jamie Lokier wrote:
> Chuck Lever wrote:
> > the whole reason to do read-ahead only when waiting for a page is that
> > there is nothing else to do. the overhead of scheduling the extra reads
> > is swallowed by the wait for the page we really want.
>
> If you ever get a situation where you've "nothing else to do", so you're
> waiting for a page, your read-ahead strategy has already failed.

i can think of several cases where that's not true. for instance, how
about the first read of a file? somehow the OS has to find out that an
application wants file data in the first place. should i add read-ahead to
open() and mmap() ? the OS doesn't even know how the file will be used at
that point. what about truly random file accesses? read-ahead strategies
are speculative and heuristic; they will never be perfect and we should
not expect them to be.

> Adding a check to the fast path slows down all page faults, but
> eliminates waiting in the perfectly-predictable cases.

so essentially, your argument is that it's OK to slow down programs that
can keep their entire working set in memory just so I/O bound programs can
go a little faster? as memory becomes more plentiful, how can we justify
that trade-off?

soft page faults are often filled by sharing a page, too. it doesn't seem
worth it to check for read-ahead when the file is shared.

> You're doing tests & I'm not, so may I ask you: how much time does your
> program spend waiting in the no_cached_page branch? Admittedly I would
> not expect it to be long with a modern disk subsystem, but the ideal is
> zero.

there is no wait in the no_cached_page branch, really. all the pages for
the given cluster are requested from the disk, and in that process, their
pages are added to the page cache if not already there. the wait occurs
if the page still isn't up to date after all the read-ahead is done, but
that happens in the mainline code.

> Hmm. I would think the minimal test in found_page could be inline and
> this simple:
>
> if (offset + area->v_rawin >= area->v_raend && offset < area->v_raend)
> goto soft_readahead;
>
> This would only branch once per cluster for perfect sequential accesses
> and hardly ever for random accesses. soft_readahead issues some I/O
> requests, moves the window, then comes back here.

i think there is an assumption built-in to the design of read-ahead in
Linux: that files accessed via read() are more likely to be sequentially
accessed than files accessed via mmap(). thus, the generic read stuff has
more sophisticated read-ahead logic baked into it, and the filemap stuff
is built more for random access. IMHO, this design is the right decision.
what you are suggesting is copying some of the read-ahead logic from
generic_file_readahead into filemap_nopage, right? it might be reasonable
to check more carefully in filemap_nopage that a file is being read
sequentially.

> > i believe that ideally, i/o count for a read-ahead enabled system should
> > be as close to a non read-ahead enabled system as possible. this reduces
> > cache pollution and disk bandwidth requirements.
>
> I don't understand you here. You say the number of i/os should be
> reduced, but the i/o count for non read-ahead is surely larger because
> each page is read individually.

quite the opposite. the goal of read-ahead is to prevent waiting for
i/o, not to reduce the overall number of blocks read.

ideally, a non-enabled OS would read exactly the pages needed by an
application. this would be slow, but would read the least number of
pages. read-ahead, because it is speculative, will almost certainly read
extra pages that will not be required by the application. thus any
read-ahead algorithm will generate more i/o than not reading ahead at all.
however, as the number of extra pages read during read-ahead increases,
the overall efficiency of the system as a whole drops, since reading extra
pages lowers the maximum rate the OS can fulfill disk requests from
applications.

- Chuck Lever

--
corporate:	<chuckl@netscape.com>
personal:	<chucklever@netscape.net> or <cel@monkey.org>

The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/