Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.

From: Daniel Phillips (phillips@bonn-fries.net)
Date: Thu Mar 14 2002 - 06:59:27 EST

Next message: Martin Dalecki: "Re: Question about the ide related ioctl's BLK* in 2.5.7-pre1 kernel"
Previous message: Vojtech Pavlik: "[patch] Big IDE chipset driver update, please test!"
In reply to: Andrew Morton: "Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On March 13, 2002 08:50 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > That's the thrust of my current work - massaging things into a form where
> > struct page can be substituted for buffer_head as the block data handle
> > for the mass of filesystems that use it.
> > ...
> >
> > For me, the missing piece of the puzzle is how to recover the semantics of
> > ->b_flushtime. The crude solution is just to put that in struct page for
> > now. At least that's a wash in terms of size because ->buffers goes out.
>
> I'm currently doing that in struct address_space(!). Maybe struct inode
> would make more sense...

I don't know, when do we ever use an address_space that's not attached to an
inode? Hmm, swapper_space. Though nobody does it now, it's also possible to
have more than one address_space per inode. This confuses the issue because
you don't know whether to flush them together, separately, or what. By the
time things get this murky, its time for VM to step back and let the
filesystem itself establish the flushing policy. No, we don't have any model
for how to express that, and we need one. What you're developing here is
a generic_flush, mainly for use by dumb filesystems, and by lucky accident,
also suitable for Ext3.

So, hrm, the sky won't fall either places you put it.

> So the mapping records the time at which it was first dirtied. So the
> `kupdate' function simply writes back all files which had their
> first-dirtying time between 30 and 35 seconds ago.

I guess you have to be careful to set the first-dirtying time again after
submitting all the IO, in case the inode got dirtied again while you were
busy submitting.

> That works OK, but it also needs to cope with the case of a single
> huge dirty file. For that case, periodic writeback also terminates
> when it has written back 1/6th of all the dirty pages in the machine.

You need a way of cycling through all the inodes on the system reliably,
otherwise you'll get nasty situations where repeated dirtying starves some
inodes of updates. This has to have file offset resolution, otherwise more
flush starvation corner cases will start crawling out of the woodwork.

The 1/6th rule is an oversimplification, it should at least be based on how
much flush IO is already in flight, and other more difficult measures we
haven't even started to address yet, such as how much and what kind of other
IO is competing for the same bandwidth, and how much bandwidth is available.

> This is all fairly arbitrary, and is basically designed to map onto
> the time-honoured behaviour. I haven't observed any surprises from
> it, nor any reason to change it.

It's surprisingly resistant to flaming. The starvation problem is going to
get ugly, it's just as hard as the elevator starvation question and the
crude, inode-level resolution of the flushtime makes it tricky. But I think
it can be beaten into some kind of shape where the corner cases are seen to
be bounded.

I don't know, I need to think about it more. It's both convenient and
strange not to maintain per-block flushtime.

> We also need to discuss writeback of metadata. For delayed allocate
> files, indirect blocks are not a problem, because I start I/O against
> them *immediately*, as soon as they're dirtied. This is because we
> know that the indirect's associated data blocks are also under I/O.
>
> Which leaves bitmaps and inode blocks. These I am leaving on the
> dirty buffer LRU, so nothing has changed there.

Easy, give them an address_space.

[snip fascinating/timesucking sortie into online defrag]

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Next message: Martin Dalecki: "Re: Question about the ide related ioctl's BLK* in 2.5.7-pre1 kernel"
Previous message: Vojtech Pavlik: "[patch] Big IDE chipset driver update, please test!"
In reply to: Andrew Morton: "Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Mar 15 2002 - 22:00:17 EST