Re: [PATCH/RFC] Simplified Readahead
From: Steven Pratt
Date: Mon Sep 27 2004 - 10:32:11 EST
Ram Pai wrote:
On Fri, 24 Sep 2004, Ram Pai wrote:
On Thu, 23 Sep 2004, Steven Pratt wrote:
The readahead code has undergone many changes in the 2.6 kernel and the
current implementation is in my opinion obtuse and hard to maintain. We
would like to offer up an alternative simplified design which will not
only make the code easier to maintain, but as performance tests have
shown, results in better performance in many cases.
We are very interested in having others review and try out the code and
run whatever performance tests they see fit.
Quick overview of the new design:
The key design point of the new design is to make the readahead code
aware of the size of the I/O request. This change eliminates the need
for treating large random I/O as sequential and all of the averaging
code that exists just to support this. In addition to this change, the
new design ramps up quicker, and shuts off faster. This, combined with
the request size awareness eliminates the so called "slow read path"
that we try so hard to avoid in the current code. For complete details
on the design of the new readahead logic, please refer to
http://www-124.ibm.com/developerworks/opensource/linuxperf/readahead/read-ahead-design.pdf
There are a few exception cases which still concern me.
1. pages already in cache
2. I/O queue congestion.
3. page stealing
The first of these is a file already residing in page cache. If we do
not code for this case we will end up doing multiple page lookups for
each page. The current code tries to handle this using the
check_ra_success function, but this code does not work.
check_ra_success will subtract 1 page each time an I/O is completely
contained in page cache, however on the main path we will increment the
window size by 2 for each page in the request (up to max_readahead) thus
negating the reduction. My question is what is the right behavior.
Not exactly true. If you look at the the parameters of check_ra_success()
it takes orig_next_size as its third parameter. And orig_next_size is
initialized to next_size right at the beginning of the function.
So if the pages are already in the page cache , the next_size is decremented
effectively by 1.
Ah..I misread what you said. Right the first time the next_size is decremented
but since those pages will reside in the current window, next time onwards
since the pages are already in the page cache the next_size keeps incrementing.
Hence check_ra_success() effectively becomes useless.
Looked at your code to see how you handled page cache hits, and if my reading is
correct you don't handle it at all?
Correct, in the first version I sent I did not handle this. In the
second version I do.
Reducing the size of the ahead window doesn't help. You must turn off
readahead to have any effect. Once you believe all pages to be in page
cache we should just immediately turn off readahead. What is this
trigger point? 4 I/Os in a row? 400? My concern is that the savings
for skipping the double lookup appears to be on the order of .5% CPU in
my tests, but the penalty for small I/O in sequential read case can be
substantial. Currently the new code does not handle this case, but it
could be enhanced to do so.
ok. so you don't handle the case too.
Right, the code currently ignores queue congestion, but based on Andrew
and Nick's comments I think we will change that slightly.
Currently the code does handle this automatically. If the size of the
next_size is 'n' then if the next 'n' window reads are already
in the page cache the readahead turns off. The problem with shutting
down readahead the first time when all the pages are in the cache, is that
there you will end up in the slow read path and it becomes miserable.
Because then onwards only one page gets read at a time even though
the read request is for larger number of pages. This behavior needs to
be fixed.
The second case is on queue congestion. Current code does not submit
the I/O if the queue is congested. This will result in each page being
read serially on the cache miss path. Does submitting 32 4k I/Os
instead of 1 128k I/O help queue congestion? Here is one place where
the current cod gets real confusing. We reduce the window by 1 page for
queue congestion(treated the same as page cache hit), but we leave the
information about the current and ahead windows alone even though we did
not issue the I/O to populate it and so we will never issue the I/O from
the readahead code. Eventually we will start taking cache misses since
no one read the pages. That code decrements the window size by 3, but
as in the cache hit case since we are still reading sequentially we keep
incrementing by 2 for each page; net effect -1 for each page not in
cache. Again, the new code ignores the congestion case and still
tries to do readahead, thus minimizing/optimizing the I/O requests sent
to the device. Is this right? If not what should we do?
Yes. The code is currently not differentiating
queue congestion against 'pages already in the page cache'.
I think if the queue is congested and if we are trying to populate
the current window, we better wait by calling schedule(),
and if we are trying to populate readahead window, we can return
immediately resetting the ahead_size and ahead_start?
The third exception case is page stealing where the page into which
readahead was done is reclaimed before the data is copied to user
space. This would seem to be a somewhat rare case only happening under
sever memory pressure, but my tests have shown that it can occur quite
frequently with as little as 4 threads doing 1M readahead or 16 threads
doing 128k readahead on a machine with 1GB memory of which 950MB is page
cache. Here it would seem the right thing to do is shrink the window
size and reduce the chance for page stealing, this however kill I/O
performance if done to aggressively. Again the current code may not
I have seen pages not up2date on most of my stress tests. But have not observed
page frames being stolen before the pages are accessed.
perform as expected. As in the 2 previous cases, the -3 is offset by a
+2 and so unless > 2/3 of pages in a given window are stolen the net
effect is to ignore the page stealing. New code will slowly shrink the
window as long as stealing occurs, but will quickly regrow once it stops.
both excessive readahead-thrashing and page-cache-hit imply readahead needs to
be temporarily halted. How about decrementing a counter for every page that is
trashed or is hit in the page cache and incrementing the counter for every
page-misses or no-page-trash. If the counter reaches 0 stop readahead. If the
counter reaches max-readahead resume readahead. something along these lines...
What I do now for page cache hits is count how many pages in a row are
found in page cache when trying to do readahead. If any page is missing
I reset the count to 0 and start over since we want to avoid
fragmented/small I/Os. The limit on how many pages in a row disable
readahead is arbitrarily set at 2560 (10M). I don't know if this is a
good number or not.
To summarize you noticed 3 problems:
1. page cache hits not handled properly.
2. readahead thrashing not accounted.
3. read congestion not accounted.
Yes.
Currently both the patches do not handle all the above cases.
No, thrashing was handled in the first patch, and both thrashing and
page cache hits are handled in the second. Also, it seems to be the
consensus that on normal I/O ignoring queue congestion is the right
behavior.
So if your patch performs much better than the current one, than
it is the winner anyway. But past experience has shown that some
benchmark gets a hit for any small change. This happens to be tooo big
a change.
I agree, we need more people to test this.
The code looks familiar ;)
RP
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/