Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH

From: Jan Kara

Date: Wed Apr 22 2026 - 06:01:27 EST

On Tue 21-04-26 23:37:01, Ojaswin Mujoo wrote:
> On Mon, Apr 20, 2026 at 01:28:18PM +0200, Jan Kara wrote:
> > On Sat 18-04-26 01:12:22, Ojaswin Mujoo wrote:
> > > On Thu, Apr 16, 2026 at 02:34:15PM +0200, Jan Kara wrote:
> > > > > @@ -1096,6 +1097,276 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
> > > > > +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> > > > > + struct iomap_iter *iter, struct iov_iter *i,
> > > > > + const struct iomap_writethrough_ops *wt_ops)
> > > > > +
> > > > > +{
>
> <...>
>
> > > your comment but) after this email, I started diggin a bit more into why
> > > it is needed. As per my understanding, it tackles 2 things:
> > >
> > > Problem 1. mkclean's the old EOF folio so that the FS can fault again. This
> > > allows us to allocate new blocks which previously might not be allocated
> > > if bs < ps.
> > >
> > > Problem 2. Since mmap writes can dirty data beyond EOF, we zero the range from
> > > old EOF to end of that folio so that readers dont read junk data after
> > > isize extension.
> >
> > Correct.
> >
> > > Another thing I noticed is that most users of
> > > iomap_file_buffered_write() do their own eof zeroing in the FS layer
> > > (eg, xfs_file_write_zero_eof(), ext4's new changes,
> > > ntfs_extend_initialized_size() etc).
> > > I think this FS level zerooing should take care of mkcleaning the eof
> > > folio (problem 1), as they call iomap_zero_range() which would flush the
> > > eof range anyways. So am I right in assuming that for FSes that do their
> > > own zeroing, 1. is already taken care of?
> >
> > Well, I don't see anything that would writeprotect the old tail page in
> > iomap_zero_range(). I think iomap_zero_range() calls are there mostly to
> > address 2. Not only due to mmap but also possibly to clear whatever junk
> > there can be in the blocks after EOF.
>
> Well I was thinking more like if the EOF page was mmap'd it would be
> dirty and blocks beyond EOF would be unmapped, so iomap_zero_range() will
> write it back which shall mkclean() the folio.
>
> But I think the same race we discussed for problem 2 can also occur
> here.
>
> Thread 1 (extending write) Thread 2 (mmap writer)
>
> iomap_zero_range()
> filemap_write_and_wait_range()
> // mmaps & writes EOF range
> iomap_write_iter()
> isize = new_size
> // pagecache_isize_extended() is
> needed to mkclean() old EOF page.

Yes, this race exists and unlike in the case of zeroing where it is mostly
harmless not guranteeing calling page_mkwrite() with updated i_size can
lead to filesystem tripping on assertions, data loss or similar.

> > > As for 2, I think after the EOF zeroing of the FS, there might be a
> > > window before iomap_write_iter() where an mmap writer can still dirty
> > > EOF blocks, hence the pagecache_isize_extended() would be needed here.
> > > But doesn't that then make the eof zeroing in the FS layer redundant? Am
> > > I missing something here?
> >
> > Hmm, I agree the zeroing looks duplicit (for some users of
> > pagecache_isize_extended()). And yes, doing the zeroing from
> > xfs_file_write_zero_eof() is somewhat racy (mmap writer can still come and
> > write non-zeros before we update i_size) but I'd have hard time to argue it
> > really practically matters - you are racing mmap writes with buffered
> > writes so any kind of write atomicity guarantees are not there.
>
> Yeah, seems like it is not enough to take care of either 1 or 2 and
> pagecache_isize_extended() should maybe be enough. I was just wondering
> if we could optimize it away even for normal extend path (no racing mmap),
> we can avoid the expensive folio_zero_range() calls.
>
> Regardless, Ive not looked at this more closely and its a separate issue
> so we can revisit it later. For now I wanted some clarity around
> pagecache_isize_extended() so thanks for that.

Well, but pagecache_isize_extended() doesn't guarantee on disk blocks are
zeroed out as well as it doesn't dirty the page. Also
xfs_file_write_zero_eof() potentially handles zeroing of more than a tail
page. So you cannot simply drop one of these.

> > > Regardless, for our case I think we will also need to do the
> > > pagecache_isize_extended(), mainly to take care of problem 2, but where
> > > exactly should we do it now? We currently change the isize in endio()
> > > but for aio, it can run outside inode or folio lock. I think this
> > > function needs to be called under inode lock(). Hmm.. its a bit late here so
> > > I'll revisit this tomorrow with a fresh mind :)
> >
> > I think mainly to take care of problem 1... You are correct about
> > inode_lock but since we are updating i_size, we should be better holding
> > it, shouldn't we?
>
> Yes you are correct. In the aio writethrough codepath, the inode update
> is happening without the inode lock which is wrong. I overlooked the
> fact that even aio dio uses IOMAP_DIO_FORCE_WAIT to force isize update
> under inode lock, and we should do something similar as well.

Yes.

> So in v3, I make the change that for extending writes we shall always
> finish them in "sync" fashion so ->endio() runs under inode lock. Then,
> after ->endio() in iomap_dio_complete(), I will call
> pagecache_isize_extended() to take care of this. Just like isize update
> right now, the isize_extension only runs when the IO was successful
> otherwise we return an error to the user. This gives us semantics like
> dio while handling extension properly.
>
> Does that sound okay?

Yep, sounds fine.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR