Re: [RFC] - Some notions that I would like comments on

Jamie Lokier (lkd@tantalophile.demon.co.uk)
Mon, 19 Jul 1999 19:41:56 +0200


Chuck Lever wrote:
> the whole reason to do read-ahead only when waiting for a page is that
> there is nothing else to do. the overhead of scheduling the extra reads
> is swallowed by the wait for the page we really want.

If you ever get a situation where you've "nothing else to do", so you're
waiting for a page, your read-ahead strategy has already failed.

Since ideally all pages are in cache when you want them, only the fast
path is ever executed. As read-ahead logic is required to achieve this,
it must go in the fast path. QED :)

The rest is tuning..

> i think the biggest argument against this approach is that today, the page
> fault fast path in filemap_nopage is very straight and uncomplicated.
> adding read-ahead logic to that path will slow everyone down all the time.

There is a trade-off, between perfectly predicted reads which _must_ be
predicted using logic in the fast path by definition (see above), and
not so predictable reads.

Adding a check to the fast path slows down all page faults, but
eliminates waiting in the perfectly-predictable cases. The balance is
in how many reads are predictable, how low is the latency for the I/O,
and how minimal can you make the fast path check.

> can you explain why you believe that you will have better read-ahead
> information in the found_page case than we already get in
> no_cached_page?

Because no_cached_page is never, ever run when read-ahead is working
perfectly. But perfect read-ahead cannot occur unless you actually run
the read-ahead logic. :)

> with my patch, the second run of a program whose WS almost exceeds
> physical RAM size will only read pages that have been made up to date.
> that's pretty ideal, i think.

You're doing tests & I'm not, so may I ask you: how much time does your
program spend waiting in the no_cached_page branch? Admittedly I would
not expect it to be long with a modern disk subsystem, but the ideal is
zero.

> my patch extracts the cluster page-in logic into a separate function, so
> it's easy to move and copy. i tried putting this function right at the
> top of filemap_nopage, and in found_page. both cases were slower than the
> stock kernel, i believe, because they slowed down the case that is most
> prevalent by far: where the page already resides in the page cache.

Hmm. I would think the minimal test in found_page could be inline and
this simple:

if (offset + area->v_rawin >= area->v_raend && offset < area->v_raend)
goto soft_readahead;

This would only branch once per cluster for perfect sequential accesses
and hardly ever for random accesses. soft_readahead issues some I/O
requests, moves the window, then comes back here.

If you want to experiment you can use file->f_rawin etc.

> > The net result is once the readahead window opens up enough,
> > filemap_nopage never waits on I/O, not even once per cluster. With the
> > code in place it might even be worth using readahead that's smaller than
> > the random access readaround cluster size. More I/Os, less memory used.
>
> more i/o's means less disk scalability, IMO. if each program is reading
> ahead with more i/o's, that means there's less disk bandwidth and page
> cache space to share with other programs on the system. this may not
> affect a desktop system much, but it matters a great deal on servers, and
> definitely will make the knee between "system-wide WS fits in memory" to
> "system-wide WS no longer fits in memory" a much steeper one.

I said "readahead window opens up enough" which means, by definition,
the number of I/Os is reduced to the point where the disk bandwidth is
adequate :-)

This is made self-tuning by increasing the window until the
no_cached_page is never triggered -- that's when you know your disk
bandwidth has caught up with the application.

You have to clamp this because some applications will always outpace the
disk, and the obvious limit is the standard max readahead size.

> i believe that ideally, i/o count for a read-ahead enabled system should
> be as close to a non read-ahead enabled system as possible. this reduces
> cache pollution and disk bandwidth requirements.

I don't understand you here. You say the number of i/os should be
reduced, but the i/o count for non read-ahead is surely larger because
each page is read individually.

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/