Re: [patch] delayed disk block allocation

From: Andrew Morton (akpm@zip.com.au)
Date: Mon Mar 04 2002 - 02:20:13 EST

Next message: Andrew Morton: "Re: [patch] delayed disk block allocation"
Previous message: Robert Love: "Re: interrupt - spin lock question"
In reply to: Daniel Phillips: "Re: [patch] delayed disk block allocation"
Next in thread: Daniel Phillips: "Re: [patch] delayed disk block allocation"
Reply: Daniel Phillips: "Re: [patch] delayed disk block allocation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Daniel Phillips wrote:
>
> ...
> Why did you write [patch] instead of [PATCH]? ;-)

It's a start ;)

>...
> > Global accounting of locked and dirty pages has been introduced.
>
> Alan seems to be working on this as well. Besides locked and dirty we also
> have 'pinned', i.e., pages that somebody has taken a use count on, beyond the
> number of pte+pcache references.

Well one little problem with tree owners writing code is that it's
rather hard for mortals to see what they've done. Because the diffs
come with so much other stuff. Which is my lame excuse for not having
reviewed Alan's work. But if he has global (as opposed to per-mm, per-vma,
etc) data then yes, it can go into the page_states[] array.

> I'm just going to poke at a couple of inconsequential things for now, to show
> I've read the post. In general this is really important work because it
> starts to move away from the vfs's current dumb filesystem orientation.
>
> I doubt all the subproblems you've addressed are tackled in the simplest
> possible way, and because of that it's a cinch Linus isn't just going to
> apply this. But hopefully the benchmarking team descend upon this and find
> out if it does/does't suck, and hopefully you plan to maintain it through 2.5.

The problem with little, incremental patches is that they require
a high degree of planning, trust and design. A belief that the
end outcome will be right. That's hard, and it generates a lot of
talk, and the end outcome may *not* be right.

So in the best of worlds, we have the end outcome in-hand, and testable.
If it works, then we go back to the incremental patches.

> > Testing is showing considerable improvements in system tractability
> > under heavy load, while approximately doubling heavy dbench throughput.
> > Other benchmarks are pretty much unchanged, apart from those which are
> > affected by file fragmentation, which show improvement.
>
> What is system tractability?

Sorry. General usability when the system is under load. With these
patches it's better, but still bad.

Look. A process calls the page allocator to, duh, allocate some pages.
Processes do *not* call the page allocator because they suddenly feel
like spending fifteen seconds asleep on the damned request queue.

We need to throttle the writers and only the writers. We need other tasks
to be able to obtain full benefit of the rate at which the disks can
clean memory.

You know where this is headed, don't you:

- writeout is performed by the writers, and by the gang-of-flush-threads.
- kswapd is 100% non-blocking. It never does I/O.
- kswapd is the only process which runs page_launder/shrink_caches.
- Memory requesters do not perform I/O. They sleep until memory
is available. kswapd gives them pages as they become available, and
wakes them up.

So that's the grand plan. It may be fatally flawed - I remember Linus
had a serious-sounding objection to it some time back, but I forget
what that was. We come badly unstuck if it's a disk-writer who
goes to sleep on the i-want-some-memory queue, but I don't think
it was that.

Still, this is just a VM rant. It's not the objective of this work.

> > With this patch, writepage() is still using the buffer layer, so lock
> > contention will still be high.
>
> Right, and buffers are going away one way or another.

This is a problem. I'm adding new stuff which does old things in
a new way, with no believable plan in place for getting rid of the
old stuff.

I don't think it's humanly possible to do away with struct buffer_head.
It is *the* way of representing a disk block. And unless we plan
to live with 4k pages and 4k blocks for ever, the problem is about
to get worse. Think 64k pages with 4k blocks.

Possibly we could handle sub-page segments of memory via a per-page up-to-date
bitmask. And then a `dirty' bitmask. And then a `locked' bitmask, etc. I
suspect eventually we'll end up with, say, a vector of structures attached to
each page which represents the state of each of the page's sub-segments. whoops.

So as a tool for representing disk blocks - for subdividing individual
pages of the block device's pagecache entries, buffer_heads make sense,
and I doubt if they're going away.

But as a tool for getting bulk file data on and off disk, buffer_heads
really must die. Note how submit_bh() now adds an extra memory allocation
into each buffer as it goes past. Look at some 2.5 kernel profiles....

> ...
> > Within the VM, the concept of ->writepage() has been replaced with the
> > concept of "write back a mapping". This means that rather than writing
> > back a single page, we write back *all* dirty pages against the mapping
> > to which the LRU page belongs.
>
> This is a good and natural step, but don't we want to go even more global
> than that and look at all the dirty data on a superblock, so the decision on
> what to write out is optimized across files for better locality.

Possibly, yes.

The way I'm performing writeback now is quite different from the
2.4 way. Instead of:

for (buffer = oldest; buffer != newest; buffer++)
write(buffer);

it's

        for (superblock = first; superblock != last; superblock++)
                for (dirty_inode = first; dirty_inode != last; dirty_inode++)
                        filemap_fdatasync(inode);

Again, by luck and by design, it turns out that this almost always
works. Care is taken to ensure that the ordering of the various
lists is preserved, and that we end up writing data in program-creation
order. Which works OK, because filesystems allocate inodes and blocks
in the way we expect (and desire).

What you're proposing is that, within the VM, we opportunistically
flush out more inodes - those which neighbour the one which owns
the page which the VM wants to free.

That would work. It doesn't particularly help us in the case where the VM
is trying to get free pages against a specific zone, but it would perhaps
provide some overall bandwidth benefits.

However, I'm kind of already doing this. Note how the VM's wakeup_bdflush()
call also wakes pdflush. pdflush will wake up, walk through all the
superblocks, find one which doesn't currently have a pdflush instance
working it and will start writing back that superblock's dirty pages.

(And the next wakeup_bdflush call will wake another pdflush thread,
which will go off and find a different superblock to sync, which is
in theory tons better than using a single bdflush thread for all dirty
data in the machine. But I haven't demonstrated practical benefit
from this yet).

> ...
>
> > But it may come unstuck when applied to swapcache.
>
> You're not even trying to apply this to swap cache right now are you?

No.

> > Things which must still be done include:
> >
> > [...]
> >
> > - Remove bdflush and kupdate - use the pdflush pool to provide these
> > functions.
>
> The main disconnect there is sub-page sized writes, you will bundle together
> young and old 1K buffers. Since it's getting harder to find a 1K blocksize
> filesystem, we might not care. There is also my nefarious plan to make
> struct pages refer to variable-binary-sized objects, including smaller than
> 4K PAGE_SIZE.

I was merely suggesting a tidy-up here. pdflush provides a dynamically-sized
pool of threads for writing data back to disk. So we can remove the
dedicated kupdate and bdflush kernel threads and replace them with:

wakeup_bdflush()
{
pdflush_operation(sync_old_buffers, NULL);
}

Additionally, we do need to provide ways of turning the kupdate,
bdflush and pdflush functions off and on. For laptops, swsusp, etc.
But these are really strange interfaces which have sort of crept
up on us over time. In this case we need to go back, work out
what we're really trying to do here and provide a proper set of
interfaces. Rather than `kill -STOP $(pidof kupdate)' or whatever
the heck people are using.

> ...
> > - Use pdflush for try_to_sync_unused_inodes(), to stop the keventd
> > abuse.
>
> Could you explain please?

keventd is a "process context bottom half handler". It should provide
the caller with reasonably-good response times. Recently, schedule_task()
has been used for writing ginormous gobs of discontiguous data out to
disk because the VM happened to get itself into a sticky corner.

So it's another little tidy-up. Use the pdflush pool for this operation,
and restore keventd's righteousness.

> ...
> I guess the thing to do is start thinking about parts that can be broken out
> because of obvious correctness. The dirty/locked accounting would be one
> candidate, the multiple flush threads another, and I'm sure there are more
> because you don't seem to have treated much as sacred ;-)

Yes, that's a reasonable ordering. pdflush is simple and powerful enough
to be useful even if the rest founders - rationalise kupdate, bdflush,
keventd non-abuse, etc. ratcache is ready, IMO. The global page-accounting
is not provably needed yet.

Here's another patch for you:

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre2/dallocbase-10-readahead.patch

It's against 2.5.6-pre2 base. It's a partial redesign and a big
tidy-up of the readahead code. It's largely described in the
comments (naturally).

- Unifies the current three readahead functions (mmap reads, read(2)
and sys_readhead) into a single implementation.

- More aggressive in building up the readahead windows.

- More conservative in tearing them down.

- Special start-of-file heuristics.

- Preallocates the readahead pages, to avoid the (never demonstrated,
but potentially catastrophic) scenario where allocation of readahead
pages causes the allocator to perform VM writeout.

- {hidden agenda): Gets all the readahead pages gathered together in
one spot, so they can be marshalled into big BIOs.

- reinstates the readahead tunables which Mr Dalecki cheerfully chainsawed.
So hdparm(8) and blockdev(8) are working again. The readahead settings
are now per-request-queue, and the drivers never have to know about it.

- Identifies readahead thrashing.

  Note "identifies". This is 100% reliable - it detects readahead
  thrashing beautifully. It just doesn't do anything useful about
  it :(

  Currently, I just perform a massive shrink on the readahead window
  when thrashing occurs. This greatly reduces the amount of pointless
  I/O which we perform, and will reduce the CPU load. But big deal. It
  doesn't do anything to reduce the seek load, and it's the seek load
  which is the killer here. I have a little test case which goes from
  40 seconds with 40 files to eight minutes with 50 files, because the
  50 file case invokes thrashing. Still thinking about this one.

- Provides almost unmeasurable throughput speedups!

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: [patch] delayed disk block allocation"
Previous message: Robert Love: "Re: interrupt - spin lock question"
In reply to: Daniel Phillips: "Re: [patch] delayed disk block allocation"
Next in thread: Daniel Phillips: "Re: [patch] delayed disk block allocation"
Reply: Daniel Phillips: "Re: [patch] delayed disk block allocation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Mar 07 2002 - 21:00:31 EST