Re: [PATCH] Memory management livelock

From: Nick Piggin
Date: Thu Oct 02 2008 - 22:54:49 EST


On Friday 03 October 2008 12:32, Nick Piggin wrote:
> On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
> > On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
> >
> > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > > >
> > > > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > > > > The bug happens when one process is doing sequential buffered
> > > > > writes to a block device (or file) and another process is
> > > > > attempting to execute sync(), fsync() or direct-IO on that device
> > > > > (or file). This syncing process will wait indefinitelly, until the
> > > > > first writing process finishes.
> > > > >
> > > > > For example, run these two commands:
> > > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > > > >
> > > > > The bug is caused by sequential walking of address space in
> > > > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > > > process is constantly making dirty and writeback pages while these
> > > > > functions run, the functions will wait on every new page, resulting
> > > > > in indefinite wait.
>
> I think the problem has been misidentified, or else I have misread the
> code. See below. I hope I'm right, because I think the patches are pretty
> heavy on complexity in these already complex paths...
>
> It would help if you explicitly identify the exact livelock. Ie. give a
> sequence of behaviour that leads to our progress rate falling to zero.
>
> > > > Shouldn't happen. All the data-syncing functions should have an upper
> > > > bound on the number of pages which they attempt to write. In the
> > > > example above, we end up in here:
> > > >
> > > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > > > start,
> > > > loff_t end, int sync_mode)
> > > > {
> > > > int ret;
> > > > struct writeback_control wbc = {
> > > > .sync_mode = sync_mode,
> > > > .nr_to_write = mapping->nrpages * 2, <<--
> > > > .range_start = start,
> > > > .range_end = end,
> > > > };
> > > >
> > > > so generic_file_direct_write()'s filemap_write_and_wait() will
> > > > attempt to write at most 2* the number of pages which are in cache
> > > > for that inode.
> > >
> > > See write_cache_pages:
> > >
> > > if (wbc->sync_mode != WB_SYNC_NONE)
> > > wait_on_page_writeback(page); (1)
> > > if (PageWriteback(page) ||
> > > !clear_page_dirty_for_io(page)) {
> > > unlock_page(page); (2)
> > > continue;
> > > }
> > > ret = (*writepage)(page, wbc, data);
> > > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > > unlock_page(page);
> > > ret = 0;
> > > }
> > > if (ret || (--(wbc->nr_to_write) <= 0))
> > > done = 1;
> > >
> > > --- so if it goes by points (1) and (2), the counter is not
> > > decremented, yet the function waits for the page. If there is constant
> > > stream of writeback pages being generated, it waits on each on them ---
> > > that is, forever.
>
> *What* is, forever? Data integrity syncs should have pages operated on
> in-order, until we get to the end of the range. Circular writeback could
> go through again, possibly, but no more than once.

OK, I have been able to reproduce it somewhat. It is not a livelock,
but what is happening is that direct IO read basically does an fsync
on the file before performing the IO. The fsync gets stuck behind the
dd that is dirtying the pages, and ends up following behind it and
doing all its IO for it.

The following patch avoids the issue for direct IO, by using the range
syncs rather than trying to sync the whole file.

The underlying problem I guess is unchanged. Is it really a problem,
though? The way I'd love to solve it is actually by adding another bit
or two to the pagecache radix tree, that can be used to transiently tag
the tree for future operations. That way we could record the dirty and
writeback pages up front, and then only bother with operating on them.

That's *if* it really is a problem. I don't have much pity for someone
doing buffered IO and direct IO to the same pages of the same file :)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2008-10-03 11:21:31.000000000 +1000
+++ linux-2.6/mm/filemap.c 2008-10-03 12:00:17.000000000 +1000
@@ -1304,11 +1304,8 @@ generic_file_aio_read(struct kiocb *iocb
goto out; /* skip atime */
size = i_size_read(inode);
if (pos < size) {
- retval = filemap_write_and_wait(mapping);
- if (!retval) {
- retval = mapping->a_ops->direct_IO(READ, iocb,
+ retval = mapping->a_ops->direct_IO(READ, iocb,
iov, pos, nr_segs);
- }
if (retval > 0)
*ppos = pos + retval;
if (retval) {
@@ -2110,18 +2107,10 @@ generic_file_direct_write(struct kiocb *
if (count != ocount)
*nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);

- /*
- * Unmap all mmappings of the file up-front.
- *
- * This will cause any pte dirty bits to be propagated into the
- * pageframes for the subsequent filemap_write_and_wait().
- */
write_len = iov_length(iov, *nr_segs);
end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
- if (mapping_mapped(mapping))
- unmap_mapping_range(mapping, pos, write_len, 0);

- written = filemap_write_and_wait(mapping);
+ written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
if (written)
goto out;

@@ -2507,7 +2496,8 @@ generic_file_buffered_write(struct kiocb
* the file data here, to try to honour O_DIRECT expectations.
*/
if (unlikely(file->f_flags & O_DIRECT) && written)
- status = filemap_write_and_wait(mapping);
+ status = filemap_write_and_wait_range(mapping,
+ pos, pos + written - 1);

return written ? written : status;
}