Re: [PATCH v6 18/34] dio: Pin pages rather than ref'ing if appropriate

From: Al Viro
Date: Thu Jan 19 2023 - 00:09:22 EST


On Mon, Jan 16, 2023 at 11:10:11PM +0000, David Howells wrote:
> Convert the generic direct-I/O code to use iov_iter_extract_pages() instead
> of iov_iter_get_pages(). This will pin pages or leave them unaltered
> rather than getting a ref on them as appropriate to the iterator.
>
> The pages need to be pinned for DIO-read rather than having refs taken on
> them to prevent VM copy-on-write from malfunctioning during a concurrent
> fork() (the result of the I/O would otherwise end up only visible to the
> child process and not the parent).

Several observations:

1) fs/direct-io.c is ancient, grotty and has few remaining users.
The case of block devices got split off first; these days it's in
block/fops.c. Then iomap-using filesystems went to fs/iomap/direct-io.c,
leaving this sucker used only by affs, ext2, fat, exfat, hfs, hfsplus, jfs,
nilfs2, ntfs3, reiserfs, udf and ocfs2. And frankly, the sooner it dies
the better off we are. IOW, you've picked an uninteresting part and left
the important ones untouched.

2) if you look at the "should_dirty" logics in either of those (including
fs/direct-io.c itself) you'll see a funny thing. First of all,
dio->should_dirty (or its counterparts) gets set iff we have a user-backed
iter and operation is a read. I.e. precisely the case when you get bio
marked with BIO_PAGE_PINNED. And that's not an accident - look at the
places where we check that predicate: dio_bio_submit() calls
bio_set_pages_dirty() if that predicate is true before submitting the
sucker and dio_bio_complete() uses it to choose between bio_check_pages_dirty()
and bio_release_pages() + bio_put().

Look at bio_check_pages_dirty() - it checks if any of the pages we were
reading into had managed to lose the dirty bit; if none had it does
bio_release_pages(bio, false) + bio_put(bio) and returns. If some had,
it shoves bio into bio_dirty_list and arranges for bio_release_pages(bio, true)
+ bio_put(bio) called from upper half (via schedule_work()). The effect
of the second argument of bio_release_pages() is to (re)dirty the pages;
it can't be done from interrupt, so we have to defer it to process context.

Now, do we need to redirty anything there? Recall that page pinning had
been introduced precisely to prevent writeback while the page is getting
DMA into it. Who is going to mark it clean before we unpin it?

Unless I misunderstand something fundamental about the whole thing,
this crap should become useless with that conversion. And it's not just
the ->should_dirty and its equivalents - bio_check_pages_dirty() and
the stuff around it should also be gone once block/fops.c and
fs/iomap/direct-io.c are switched to your iov_iter_extract_pages.
Moreover, that way the only places legitimately passing true to
bio_release_pages() are blk_rq_unmap_user() (on completion of
bypass read request mapping a user page) and __blkdev_direct_IO_simple()
(on completion of short sync O_DIRECT read from block device).
Both could just as well call bio_set_pages_dirty(bio) +
bio_release_pages(bio, false), killing the "dirty on release" logics
and losing the 'mark_dirty' argument.

BTW, where do we dirty the pages on IO_URING_OP_READ_FIXED with
O_DIRECT file? AFAICS, bio_set_pages_dirty() won't be called
(ITER_BVEC iter) and neither will bio_release_pages() do anything
(BIO_NO_PAGE_REF set on the bio by bio_iov_bvec_set() called
due to the same ITER_BVEC iter). Am I missing something trivial
here? Jens?