Fwd: Re: ext3_readdir() readahead problem

From: Wu Fengguang
Date: Thu Nov 10 2005 - 03:55:12 EST


Just in case this piece of message will benefit someone in the mailing list ;)
--- Begin Message --- Wu Fengguang <wfg@xxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Nov 09, 2005 at 07:34:58PM -0800, Andrew Morton wrote:
> > Wu Fengguang <wfg@xxxxxxxxxxxxxxxx> wrote:
> > Ah, OK. Rather than showing a stream of numbers it really helps if you can
> > tell people what the numbers _mean_.
> Got it. Thanks!
> > > >

Minor point: your email are mush more readable if you put a blank line
before and after your paragraphs, like this ;)

> >
> > Part of the page. If PAGE_CACHE_SIZE=4k and it's a 1k blocksize
> > filesystem, we'll only read 1k from disk.
> So it's one block, or one buffer_head, am I right?

buffer_head is misnamed. It used to be both a caching concept and an IO
container. It's still an IO container sometimes, but it really should be
renamed `struct block'. It is the kernel's core abstraction for a disk
block. Usually of size <= PAGE_CACHE_SIZE. There are a few places where
bh->b_size is >PAGE_CACHE_SIZE, in the get_blocks() callback. But that's
an exception.

A buffer_head is metadata against a struct page, telling us the state of a
subsection of a page, and also telling us the disk mapping (ie:
partition-relative block number) for that page subsection.

> > > suboptimal to let both ext3_readdir() and page_cache_readahead() do some part
> > > of I/O. The best scheme should be to test page existence and call read-ahead in
> > > the very beginning(maybe before ext3_getblk()).
> >
> > ext3_getblk() doesn't actually read the block from disk. All it will do is
> > to determine the location of the block on disk. Plus if it's a write and
> > if we newly created the block, ext3 will perform journalling of the buffer.
>
> But it inserts the page into radix tree, which effectively prevents
> __do_page_cache_readahead() to read that page, and makes (actual <= req_size-1)
> in the following trace.

erk. That's pretty screwed up, isn't it?

I need to think again. Thanks.


--- End Message ---