Re: [PATCH/RFC] Simplified Readahead

From: Ram Pai
Date: Wed Sep 29 2004 - 20:13:52 EST


On Wed, 29 Sep 2004, Steven Pratt wrote:

> Ram Pai wrote:
>
> snip...
>
> >
> >
> > I have enclosed 5 patches that address each of the issues.
> >
> >1 . Code is obtuse and hard to maintain
> >
> > The best I could do is update the comments to reflect the
> > current code. Hopefully that should help.
> >
> > attached patch 1_comment.patch takes care of that part to
> > some extent.
> >
> >
> I was more concerend with the multiple ways in which the readhead
> changed, and the fact that we were going down the sequential read path
> on random I/O requests. More comments are always good, but that was not
> really my concern.
>
> >
> >2. page cache hits not handled properly.
> >
> > I fixed this by decrementing the size of the next readahead window
> > by the number of pages hit in the page cache. Now it slowly
> > accomodates the page cache hits.
> >
> > attached patch 2_cachehits.patch takes care of this issue.
> >
> >
> I think this will be way too agressive. This means turn off readhead if
> 1 max_readhead I/O is completely in cache. You then will need to do
> multiple 4k I/Os to get it turned back on.

Having a max_readahead pages completely in cache is pretty good reason
to switch-off readahead. Yes it is aggressive, and the downside of
this is multiple 4k I/o. That is the reason we need the slow-read
patch[patch number 5] to speed up the slow-read path.


>
> >3. queue congestion not handled.
> >
> > The fix is: call force_page_cache_readahead() if we are
> > populating pages in the current window.
> > And call do_page_cache_readahead() if we are populating
> > pages in the ahead window. However if do_page_cache_readahead()
> > return with congestion, the readahead window is collapsed back
> > to size zero. This will ensure that the next time ahead window
> > is attempted to populate.
> >
> > attached patch 3_queuecongestion.patch handles this issue.
> >
> >
> Yeah, I had thought about something along these lines. Just not sure if
> it is worth it.

Well does not seem to hurt!

>
> >4. page thrash handled ineffectively.
> >
> > The fix is: on page thrash detection shutdown readahead.
> >
> > attached patch 4_pagethrash.patch handles this issue.
> >
> >
> Same comments as on 2. Way too agressive.

yes. But think about it. Why did a page get thrashed? there
is memory pressure. Hence better stop reading ahead. Again
yes it is agressive. And to compensate the agression we need the
slowread patch that fixes the slow read path.


>
> >5. slow read path is too slow.
> >
> > I could not figure out a way to atleast-read-the-requested-
> > number-of-pages if readahead is shutdown, without incorporating
> > the readsize parameter to page_cache_readahead(). So had
> > to pick some of your code in filemap.c to do that. Thanks!
> >
> > attached patch 5_fixedslowread.patch handles this issue.
> >
> >
> >
> Step in the right direction. Now if I could just get you to pick up the
> rest we would be done :-)

:-) I wish it was my call.


>
> >Apart from this you have noticed other issues
> >
> >6. cache lookup done unneccessrily twice for pagecache_hits.
> >
> > I have not handled this issue currently. But should be doable
> > if I introducing a flag, which notes when readahead is
> > shutdown by pagecahche hits. And hence attempts to lookup
> > the page only once.
> >
> >
> >
> Umm, actually you do (if the code works). When you get too many cache
> hits you turn off readahead. This will disable the multiple lookups
> until you re-enable readhead which will only happen in handle_ra_miss
> which means you are reading pages not in page cache so that is ok.


yes but with the slowread patch, we attempt to read pages
and then do_generic_mapping_read() anyway attempts to check if the
page frame is there. This is a small optimization. You did mention that you
saw .5% decrease in CPU time. Did'nt you?

>
> >And you have other features in your patch which will be the real
> >differentiating factors.
> >
> >7. exponential expand and shrink of window sizes.
> >
> >8. overlapped read of current window and ahead window.
> >
> > ( I think both are desirable feature )
> >
> >
> Glad we agree.
>
We never disagreed. Did we? :) The concern Andrew had was, have you
tried to fix the problem with the current code, and then check
if your patch does better with the fixed-code.

I am trying to help you get a solid case to get Andrew's acceptance.

So with these set of patches, your algorithm and the current algorithm
are on even playing ground to do some fair comparison.


> >I did run some premilinary tests using your patch and the above patches
> >and found
> >
> >your patch was doing slightly better on iozone and sysbench.
> >however the above patch were doing slightly better with DSS workload.
> >
> >
> Care to expand on slightly better? Also these tests don't cover many
> cases. You are only running iozone single threaded which won't show
> much, you don't seem to vary IO requests sizes to see the effect (all
> this based on your readahead web page). Also it would be really helpful
> if you had some sort of machine description, disk, memory etc.

2proc 700Mhz , 2GBmemory. Ultra160 SCSI disk.

> Also for
> the DSS workload I need ot know what type of IO pattern you generate. I
> know it is mostly random IO, but the size of the IOs and the number of
> prefetcher threads and the number of disks make a huge difference.

256k iosize with 50 prefetchers.

>
> Also what is the units in the iozone results? I thought it was in
> kbytes which would make your throughputs in the 350-400MB/sec range
> which is way more than you could do on a single adapter. So unless I am
> really off base here, your IOzone results appear to be mostly cache
> reads and thus not really testing readahead. I must have missed something.

correct observation. My filesize is 1.2GB and memory size is 2GB so it
will end up being entirely cache based. Good observation. Rerunning
with a larger file.

>
> >But my setup is rather tiny compared to your setup, so my comparison
> >is rather incomplete.
> >
> >
> Not really. I made sure that I ran this on machine from single cpu ide
> up through really big machines. It is important to run well across a
> variety of platforms which I tried to ensure my code does.
>
>
> I'll try to run this through my test suite tomorrow

sure. that will help. And once results are in place, its Andrew's call.
No, these patches are not in competition with yours.
Its just helping you to convince others.

> , but is there
> something you don't like about the new code? You seem to be moving in
> that direction. Is there any reason to not make the complete jump and
> help out on the (hopefully) simpler code base?

Thought I was helping you out, No?

BTW: the last patch 5_fixedslowread.patch accidently did not contain all
the changes. Enclosed the remaining changes.

RP
>
> Steve
>
--- linux-2.6.8/mm/filemap.c 2004-09-29 17:30:17.089248464 -0700
+++ linux-2.6.8.comment.pagecachehit.queuecongestion.pagethrash/mm/filemap.c 2004-09-29 18:13:02.968175576 -0700
@@ -717,14 +717,14 @@ void do_generic_mapping_read(struct addr
read_actor_t actor)
{
struct inode *inode = mapping->host;
- unsigned long index, end_index, offset;
+ unsigned long index, end_index, offset, req_size, next_index;
loff_t isize;
struct page *cached_page;
int error;
struct file_ra_state ra = *_ra;

cached_page = NULL;
- index = *ppos >> PAGE_CACHE_SHIFT;
+ next_index = index = *ppos >> PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;

isize = i_size_read(inode);
@@ -732,6 +732,8 @@ void do_generic_mapping_read(struct addr
goto out;

end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+ req_size = end_index - index + 1;
+
for (;;) {
struct page *page;
unsigned long nr, ret;
@@ -749,12 +751,22 @@ void do_generic_mapping_read(struct addr
nr = nr - offset;

cond_resched();
- page_cache_readahead(mapping, &ra, filp, index);
+ /*
+ * Let page_cache_readahead() know that we are accessing
+ * 'req_size' number of pages.
+ */
+ if(index == next_index && req_size) {
+ unsigned long ret_size;
+ ret_size = page_cache_readahead(mapping, &ra,
+ filp, index, req_size);
+ next_index += ret_size;
+ req_size -= ret_size;
+ }

find_page:
page = find_get_page(mapping, index);
if (unlikely(page == NULL)) {
- handle_ra_miss(mapping, &ra, index);
+ handle_ra_miss(mapping, &ra, index, 1);
goto no_cached_page;
}
if (!PageUptodate(page))
@@ -1195,7 +1207,7 @@ retry_all:
* For sequential accesses, we use the generic readahead logic.
*/
if (VM_SequentialReadHint(area))
- page_cache_readahead(mapping, ra, file, pgoff);
+ page_cache_readahead(mapping, ra, file, pgoff, 1);

/*
* Do we have something in the page cache already?
@@ -1206,7 +1218,7 @@ retry_find:
unsigned long ra_pages;

if (VM_SequentialReadHint(area)) {
- handle_ra_miss(mapping, ra, pgoff);
+ handle_ra_miss(mapping, ra, pgoff, 1);
goto no_cached_page;
}
ra->mmap_miss++;
--- linux-2.6.8/include/linux/mm.h 2004-09-29 17:30:15.857435728 -0700
+++ linux-2.6.8.comment.pagecachehit.queuecongestion.pagethrash/include/linux/mm.h 2004-09-29 18:06:40.500319528 -0700
@@ -721,12 +721,14 @@ int do_page_cache_readahead(struct addre
unsigned long offset, unsigned long nr_to_read);
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
unsigned long offset, unsigned long nr_to_read);
-void page_cache_readahead(struct address_space *mapping,
+unsigned long page_cache_readahead(struct address_space *mapping,
struct file_ra_state *ra,
struct file *filp,
- unsigned long offset);
+ unsigned long offset,
+ unsigned long size);
void handle_ra_miss(struct address_space *mapping,
- struct file_ra_state *ra, pgoff_t offset);
+ struct file_ra_state *ra, pgoff_t offset,
+ unsigned long size);
unsigned long max_sane_readahead(unsigned long nr);

/* Do stack extension */