Re: [PATCH v9 1/2] fs: New zonefs file system

From: Damien Le Moal
Date: Wed Jan 29 2020 - 22:00:43 EST


On Thu, 2020-01-30 at 08:33 +1100, Dave Chinner wrote:
> On Wed, Jan 29, 2020 at 01:06:29PM +0000, Damien Le Moal wrote:
> > On Tue, 2020-01-28 at 09:46 -0800, Darrick J. Wong wrote:
> > > > +static int zonefs_io_err_cb(struct blk_zone *zone, unsigned int idx, void *data)
> > > > +{
> > > > + struct zonefs_ioerr_data *ioerr = data;
> > > > + struct inode *inode = ioerr->inode;
> > > > + struct zonefs_inode_info *zi = ZONEFS_I(inode);
> > > > + struct super_block *sb = inode->i_sb;
> > > > + loff_t isize, wp_ofst;
> > > > +
> > > > + /*
> > > > + * The condition of the zone may have change. Fix the file access
> > > > + * permissions if necessary.
> > > > + */
> > > > + zonefs_update_file_perm(inode, zone);
> > > > +
> > > > + /*
> > > > + * There is no write pointer on conventional zones and read operations
> > > > + * do not change a zone write pointer. So there is nothing more to do
> > > > + * for these two cases.
> > > > + */
> > > > + if (zi->i_ztype == ZONEFS_ZTYPE_CNV || !ioerr->write)
> > > > + return 0;
> > > > +
> > > > + /*
> > > > + * For sequential zones write, make sure that the zone write pointer
> > > > + * position is as expected, that is, in sync with the inode size.
> > > > + */
> > > > + wp_ofst = (zone->wp - zone->start) << SECTOR_SHIFT;
> > > > + zi->i_wpoffset = wp_ofst;
> > > > + isize = i_size_read(inode);
> > > > +
> > > > + if (isize == wp_ofst)
> > > /> + return 0;
> > > > +
> > > > + /*
> > > > + * The inode size and the zone write pointer are not in sync.
> > > > + * If the inode size is below the zone write pointer, then data was
> > >
> > > I'm a little confused about what events these states reflect.
> > >
> > > "inode size is below the zone wp" -- let's say we have a partially
> > > written sequential zone:
> > >
> > > isize
> > > ----v---------------
> > > DDDDD
> > > ----^---------------
> > > WP
> > >
> > > Then we tried to write to the end of the sequential zone:
> > >
> > > isize
> > > ----v---------------
> > > DDDDDWWWW
> > > ----^---------------
> > > WP
> > >
> > > Then an error happens so we didn't update the isize, and now we see that
> > > the write pointer is beyond isize (pretend the write failed to the '?'
> > > area):
> > >
> > > isize
> > > ----v---------------
> > > DDDDDD?DD
> > > --------^-----------
> > > WP
> >
> > If the write failed at the "?" location, then the zone write pointer
> > points to that location since nothing after that location can be
> > written unless that location itself is first written.
> >
> > So with your example, the drive will give back:
> >
> > isize
> > ----v---------------
> > DDDDDD?XX
> > ------^-------------
> > WP
> >
> > With XX denoting the unwritten part of the issued write.
> >
> > > So if we increase isize to match the WP, what happens when userspace
> > > tries to read the question-mark area? Do they get read errors? Stale
> > > contents?
> >
> > Nope, see above: the write pointer always point to the sector following
> > the last sector correctly written. So increasing isize to the write
> > pointer location only exposes the data that actually was written and is
> > readable. No stale data.
> > > Or am I misunderstanding SMR firmware, and the drive only advances the
> > > write pointer once it has written a block? i.e. if a write fails in
> > > the middle, the drive ends up in this state, not the one I drew above:
> > >
> > > isize
> > > ----v---------------
> > > DDDDDD?
> > > -----^--------------
> > > WP
> > >
> > > In which case it would be fine to push isize up to the write pointer?
> >
> > Exactly. This is how the ZBC & ZAC (and upcoming ZNS) specifications
> > define the write pointer behavior. That makes error recovery a lot
> > easier and does not result in stale data accesses. Just notice the one-
> > off difference for the WP position from your example as WP will be
> > pointing at the error location, not the last written location. Indexing
> > from 0, we get (wp - zone start) always being isize with all written
> > and readable data in the sector range between zone start and zone write
> > pointer.
>
> Ok, I'm going throw a curve ball here: volatile device caches.
>
> How does the write pointer updates interact with device write
> caches? i.e. the first write could be sitting in the device write
> cache, and the OS write pointer has been advanced. Then another write
> occurs, the device decides to write both to physical media, and it
> gets a write error in the area of the first write that only hit the
> volatile cache.
>
> So does this mean that, from the POV of the OS, the device zone
> write pointer has gone backwards?

You are absolutely correct. Forgot to consider this case.
Nice pitching :)

> Unless there's some other magic that ensures device cached writes
> that have been signalled as successfully completed to the OS
> can never fail or that sequential zone writes are never cached in
> volatile memory in drives, I can't see how the above guarantees
> can be provided.

There not, at least from the standards point of view. Such guarantees
would be device implementation dependent and so we cannot rely on
anything in this regard. The write pointer ending up below the position
of the last issue direct IO is thus a possibility and not necessarily
indicative of an external action (and we actually cannot distinguish
which case it really is).

And looking at the code again, I need to add error processing in fsync
to catch this case.

> > It is hard to decide on the best action to take here considering the
> > simple nature of zonefs (i.e. another better interface to do raw block
> > device file accesses). Including your comments on mount options, I cam
> > up with these actions that the user can choose with mount options:
> > * repair: Truncate the inode size only, nothing else
> > * remount-ro (default): Truncate the inode size and remount read-only
> > * zone-ro: Truncate the inode size and set the inode read-only
> > * zone-offline: Truncate the inode size to 0 and assume that its zone
> > is offline (no reads nor writes possible).
> >
> > This gives I think a good range of possible behaviors that the user may
> > want, from almost nothing (repair) to extreme to avoid accessing bad
> > data (zone-offline).
>
> I would suggest that this is something that can be added later as it
> is not critical to supporting the underlying functionality. Right
> now I'd just pick the safest option: shutdown to protect what data
> is on the storage right now and then let the user take action to
> recover/fix the issue.

By shutdown, do you mean remounting read-only ? Or do you mean
something more aggressive like preventing all accesses and changes to
files, i.e. assuming all zones are offline ? The former is already
there and is the default.

>
> > > > + * BIO allocations for the same device. The former case may end up in
> > > > + * a deadlock on the inode truncate mutex, while the latter may prevent
> > > > + * forward progress with BIO allocations as we are potentially still
> > > > + * holding the failed BIO. Executing the report zones under GFP_NOIO
> > > > + * avoids both problems.
> > > > + */
> > > > + noio_flag = memalloc_noio_save();
> > >
> > > Don't you still need memalloc_nofs_ here too?
> >
> > noio implies nofs, doesn't it ? Or rather, noio is more restrictive
> > than nofs here. Which is safer since we need a struct request to be
> > able to execute blkdev_report_zones().
>
> Correct, noio implies nofs.
>
> Cheers,
>
> Dave.

Thanks !


--
Damien Le Moal
Western Digital Research