Re: [PATCH 04/11] mm: Add support for a filesystem to activate swapfiles and use direct_IO for writing swap pages

From: Mel Gorman
Date: Thu May 03 2012 - 10:14:18 EST


On Tue, May 01, 2012 at 03:53:08PM -0700, Andrew Morton wrote:
> > It is perfectly possible that direct_IO be used to read the swap
> > pages but it is an unnecessary complication. Similarly, it is possible
> > that ->writepage be used instead of direct_io to write the pages but
> > filesystem developers have stated that calling writepage from the VM
> > is undesirable for a variety of reasons and using direct_IO opens up
> > the possibility of writing back batches of swap pages in the future.
>
> This all seems a bit odd. And abusive.
>
> Yes, it would be more pleasing if direct-io was used for reading as
> well. How much more complication would it add?
>

Quite a bit.

Superficially it's easy because swap_readpage() just sets up a kiocb,
fills in the necessary details and call ->direct_IO. The complexity is
around page locking and writing back pending writes in NFS.

read_swap_cache_async() calls swap_readpage with the page locked and
is expected to return with the page unlocked on successful completion of
the IO.

For swap-over-nfs, the readpage handler behaves exactly as
read_swap_cache_async() expects. For everything else, submit_bio() is used
with end_swap_bio_read() unlocking the page. Both of these handlers behave
the same with respect to locking. The direct_IO handler does not expect the
page to be locked and does not unlock it itself. Even if it works for NFS,
there might be other complications in the future around page locking in
direct_IO handlers.

The second complexity may be specific to NFS. The NFS readpage handler
flushes any pending writes with nfs_wb_page() before doing the read which it
can do because it holds the page lock. It was completely unclear how the same
could be achieved from swap_readpage() in a filesystem-independent manner.

As ->readpage() already knew how to do the right thing in all cases, I
used it.

> If I understand correctly, on the read path we're taking a fresh page
> which is destined for swapcache and then pretending that it is a
> pagecache page for the purpose of the I/O?
>
> If there already existed a
> pagecache page for that file offset then we let it just sit there and
> bypass it?
>

On the read path read_swap_cache_async() checks if a page is already in
swapcache and if not not, allocates a new page, adds it to the swapcache
and calls swap_readpage. Hence I do not think we are tripping the
problem you are thinking of.

> I'm surprised that this works at all - I guess nothing under
> ->readpage() goes poking around in the address_space. For NFS, at
> least!
>
> >
> > ...
> >
> > @@ -93,11 +94,38 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> > {
> > struct bio *bio;
> > int ret = 0, rw = WRITE;
> > + struct swap_info_struct *sis = page_swap_info(page);
> >
> > if (try_to_free_swap(page)) {
> > unlock_page(page);
> > goto out;
> > }
> > +
> > + if (sis->flags & SWP_FILE) {
> > + struct kiocb kiocb;
> > + struct file *swap_file = sis->swap_file;
> > + struct address_space *mapping = swap_file->f_mapping;
> > + struct iovec iov = {
> > + .iov_base = page_address(page),
>
> Didn't we need to kmap the page?
>

.... Yep, that would be important all right. I'll look at this closely
and do a round of testing on x86-32.

> > + .iov_len = PAGE_SIZE,
> > + };
> > +
> > + init_sync_kiocb(&kiocb, swap_file);
> > + kiocb.ki_pos = page_file_offset(page);
> > + kiocb.ki_left = PAGE_SIZE;
> > + kiocb.ki_nbytes = PAGE_SIZE;
> > +
> > + unlock_page(page);
> > + ret = mapping->a_ops->direct_IO(KERNEL_WRITE,
> > + &kiocb, &iov,
> > + kiocb.ki_pos, 1);
>
> I wonder if there's any point in setting PG_writeback around the IO. I
> can't think of a reason.
>

One does not spring to mind.

> > + if (ret == PAGE_SIZE) {
> > + count_vm_event(PSWPOUT);
> > + ret = 0;
> > + }
> > + return ret;
> > + }
> > +
> > bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
> > if (bio == NULL) {
> > set_page_dirty(page);
> > @@ -119,9 +147,21 @@ int swap_readpage(struct page *page)
> > {
> > struct bio *bio;
> > int ret = 0;
> > + struct swap_info_struct *sis = page_swap_info(page);
> >
> > VM_BUG_ON(!PageLocked(page));
> > VM_BUG_ON(PageUptodate(page));
> > +
> > + if (sis->flags & SWP_FILE) {
> > + struct file *swap_file = sis->swap_file;
> > + struct address_space *mapping = swap_file->f_mapping;
> > +
> > + ret = mapping->a_ops->readpage(swap_file, page);
> > + if (!ret)
> > + count_vm_event(PSWPIN);
> > + return ret;
> > + }
>
> Confused. Where did we set up page->index with the file offset?
>

We don't use page->index in this case.

__add_to_swap_cache() records the swap entry in page->private.
nfs_readpage() looks up the page index with page_index() which for
SwapCache pages calls __page_file_index(). It in turn gets the swap
entry and looks up the index with swp_offset().

> > bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
> > if (bio == NULL) {
> > unlock_page(page);
> > @@ -133,3 +173,15 @@ int swap_readpage(struct page *page)
> > out:
> > return ret;
> > }
> > +
> > +int swap_set_page_dirty(struct page *page)
> > +{
> > + struct swap_info_struct *sis = page_swap_info(page);
> > +
> > + if (sis->flags & SWP_FILE) {
> > + struct address_space *mapping = sis->swap_file->f_mapping;
> > + return mapping->a_ops->set_page_dirty(page);
> > + } else {
> > + return __set_page_dirty_nobuffers(page);
> > + }
> > +}
>
> More confused. This is a swapcache page, not a pagecache page? Why
> are we running set_page_dirty() against it?
>

I don't really get the question. swap-over-NFS is not doing anything
different here than what we do today. PageSwapCache pages still have to
be marked dirty so they get written to disk before being discarded.

> And what are we doing on the !SWP_FILE path?

Maintaining existing behaviour. This is what the swap ops looks like
without the patchset

static const struct address_space_operations swap_aops = {
.writepage = swap_writepage,
.set_page_dirty = __set_page_dirty_nobuffers,
.migratepage = migrate_page,
};

> Newly setting PG_dirty
> against block-device-backed swapcache pages? Why? Where does it get
> cleared again?
>

clear_page_dirty_for_io() in vmscan.c#pageout() ? I might be missing
something in your question again :(

> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 9d3dd37..c25b9cf 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -26,7 +26,7 @@
> > */
> > static const struct address_space_operations swap_aops = {
> > .writepage = swap_writepage,
> > - .set_page_dirty = __set_page_dirty_nobuffers,
> > + .set_page_dirty = swap_set_page_dirty,
> > .migratepage = migrate_page,
> > };
> >
> > ...
> >

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/