Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

From: Curt Wohlgemuth
Date: Fri Jul 01 2011 - 18:55:42 EST


Hi Jan:

On Wed, Jun 29, 2011 at 12:15 PM, Jan Kara <jack@xxxxxxx> wrote:
> On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
>> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
>> > > For sys_sync I'm pretty sure we could simply remove the
>> > > writeback_inodes_sb call and get just as good if not better performance,
>> >   Actually, it won't with current code. Because WB_SYNC_ALL writeback
>> > currently has the peculiarity that it looks like:
>> >   for all inodes {
>> >     write all inode data
>> >     wait for inode data
>> >   }
>> > while to achieve good performance we actually need something like
>> >   for all inodes
>> >     write all inode data
>> >   for all inodes
>> >     wait for inode data
>> > It makes a difference in an order of magnitude when there are lots of
>> > smallish files - SLES had a bug like this so I know from user reports ;)
>>
>> I don't think that's true.  The WB_SYNC_ALL writeback is done using
>> sync_inodes_sb, which operates as:
>>
>>   for all dirty inodes in bdi:
>>      if inode belongs to sb
>>         write all inode data
>>
>>   for all inodes in sb:
>>      wait for inode data
>>
>> we still do that in a big for each sb loop, though.
>  True but writeback_single_inode() has in it:
>        if (wbc->sync_mode == WB_SYNC_ALL) {
>                int err = filemap_fdatawait(mapping);
>                if (ret == 0)
>                        ret = err;
>        }
>  So we end up waiting much earlier. Probably we should remove this wait
> but that will need some audit I guess.

So today for WB_SYNC_ALL from sync_inodes_sb(), we do:
- queue a work item; this will
- write all dirty inodes in a sb
- write one inode's pages
- wait on all inode's pages
- wait for the work item
- wait on all inodes in the sb (wait_sb_inodes)

I guess the point of wait_sb_inodes() is to wait on all inodes that
were written in a previous writeback pass.

One other issue I have with sync as it's structured is that we don't
do a WB_SYNC_ALL pass on any inode that's only associated with a block
device, and not on a mounted filesystem. Blockdev mounts are
pseudo-mounts, and are explicitly skipped in __sync_filesystem(). So
if you've written directly to a block device and do a sync, the only
pass over the pages for this inode are via the
wakeup_flusher_threads() -- which operates on a BDI, regardless of the
superblock, and uses WB_SYNC_NONE.

All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
exclude pseudo-superblocks.

I've seen cases in our modified kernels here at Google in which
lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
/dev/sda (though I haven't been able to come up with a consistent test
case, nor reproduce this on an upstream kernel).

Thanks,
Curt

>
>> >   You mean that sync(1) would actually write the data itself? It would
>> > certainly make some things simpler but it has its problems as well - for
>> > example sync racing with flusher thread writing back inodes can create
>> > rather bad IO pattern...
>>
>> Only the second pass.  The idea is that we first try to use the flusher
>> threads for good I/O patterns, but if we can't get that to work only
>> block the caller and not everyone.  But that's just an idea so far,
>> it would need serious benchmark.  And despite what I claimed before
>> we actually do the wait in the caller context already anyway, which
>> already gives you the easy part of the above effect.
>  Modulo the writeback_single_inode() wait. But if that is dealt with I
> agree.
>
>                                                                Honza
> --
> Jan Kara <jack@xxxxxxx>
> SUSE Labs, CR
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/