Re: [PATCH v7 4/6] dax: add DAX_RECOVERY flag and .recovery_write dev_pgmap_ops

From: Jane Chu
Date: Wed Apr 06 2022 - 15:46:14 EST


On 4/5/2022 10:19 PM, Christoph Hellwig wrote:
> On Tue, Apr 05, 2022 at 01:47:45PM -0600, Jane Chu wrote:
>> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is
>> not set by default in dax_direct_access() such that the helper
>> does not translate a pmem range to kernel virtual address if the
>> range contains uncorrectable errors. When the flag is set,
>> the helper ignores the UEs and return kernel virtual adderss so
>> that the caller may get on with data recovery via write.
>>
>> Also introduce a new dev_pagemap_ops .recovery_write function.
>> The function is applicable to FSDAX device only. The device
>> page backend driver provides .recovery_write function if the
>> device has underlying mechanism to clear the uncorrectable
>> errors on the fly.
>
> I know Dan suggested it, but I still think dev_pagemap_ops is the very
> wrong choice here. It is about VM callbacks to ZONE_DEVICE owners
> independent of what pagemap type they are. .recovery_write on the
> other hand is completely specific to the DAX write path and has no
> MM interactions at all.

Yes, I believe Dan was motivated by avoiding the dm dance as a result of
adding .recovery_write to dax_operations.

I understand your point about .recovery_write is device specific and
thus not something appropriate for device agnostic ops.

I can see 2 options so far -

1) add .recovery_write to dax_operations and do the dm dance to hunt
down to the base device that actually provides the recovery action

2) an ugly but expedient approach based on the observation that
dax_direct_access() has already gone through the dm dance and thus could
scoop up the .recovery_write function pointer if DAX_RECOVERY flag is
set. Like bundle action-flag with action, and if should there need more
device specific actions, just add another action with associated flag.

I'm thinking about something like this

long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
long nr_pages, struct daxdev_specific *action,
int flags, void **kaddr, pfn_t *pfn)

where
struct daxdev_specific {
int flags; /* DAX_RECOVERY, etc */
size_t (*recovery_write) (pfn_t pfn, pgoff_t pgoff, void *addr,
size_t bytes, void *iter);
}

__pmem_direct_access() provides the .recovery_write function pointer;
dax_iomap_iter() ends up directly invoke the function in pmem.c
which finds pgmap from pfn_t, and (struct pmem *) from
pgmap->owner;

In this way, we get rid of dax_recovery_write() interface as well as the
dm dance.

What do you think?

Dan, could you also chime in ?

>
>> /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
>> __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
>> - long nr_pages, void **kaddr, pfn_t *pfn)
>> + long nr_pages, int flags, void **kaddr, pfn_t *pfn)
>> {
>> resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
>> + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
>> + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT;
>> + struct badblocks *bb = &pmem->bb;
>> + sector_t first_bad;
>> + int num_bad;
>> + bool bad_in_range;
>> + long actual_nr;
>> +
>> + if (!bb->count)
>> + bad_in_range = false;
>> + else
>> + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad);
>>
>> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512,
>> - PFN_PHYS(nr_pages))))
>> + if (bad_in_range && !(flags & DAX_RECOVERY))
>> return -EIO;
>
> The use of bad_in_range here seems a litle convoluted. See the attached
> patch on how I would structure the function to avoid the variable and
> have the reocvery code in a self-contained chunk.

Much better, will use your version, thanks!

>
>> - map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
>> - &kaddr, NULL);
>> + nrpg = PHYS_PFN(size);
>> + map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr, NULL);
>
> Overly long line here.

Okay, will run the checkpatch.pl test again.

thanks!
-jane