Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

From: Dan Williams
Date: Mon Oct 24 2022 - 14:32:50 EST


Dave Chinner wrote:
> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> > > In addition, I don't like your idea about the test change because it will
> > > make generic/470 become the special test for XFS. Do you know if we can fix
> > > the issue by changing the test in another way? blkdiscard -z can fix the
> > > issue because it does zero-fill rather than discard on the block device.
> > > However, blkdiscard -z will take a lot of time when the block device is
> > > large.
> >
> > Well we /could/ just do that too, but that will suck if you have 2TB of
> > pmem. ;)
> >
> > Maybe as an alternative path we could just create a very small
> > filesystem on the pmem and then blkdiscard -z it?
> >
> > That said -- does persistent memory actually have a future? Intel
> > scuttled the entire Optane product, cxl.mem sounds like expansion
> > chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> > asserts everywhere) and 6.1 (every time I run fstests now I see massive
> > data corruption).
>
> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> don't think that has changed at all - I still see lots of kernel
> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> argument" errors.
>
> If I turn off reflink, then instead of data corruption I get kernel
> warnings like this from fsx and fsstress workloads:
>
> [415478.558426] ------------[ cut here ]------------
> [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320
> [415478.564028] Modules linked in:
> [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G W 6.1.0-rc1-dgc+ #1615
> [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320
> [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6
> [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002
> [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001
> [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840
> [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
> [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58
> [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000
> [415478.596983] FS: 00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000
> [415478.599364] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0
> [415478.602883] Call Trace:
> [415478.603598] <TASK>
> [415478.604229] dax_fault_iter+0x240/0x600
> [415478.605410] dax_iomap_pte_fault+0x19c/0x3d0
> [415478.606706] __xfs_filemap_fault+0x1dd/0x2b0
> [415478.607744] __do_fault+0x2e/0x1d0
> [415478.608587] __handle_mm_fault+0xcec/0x17b0
> [415478.609593] handle_mm_fault+0xd0/0x2a0
> [415478.610517] exc_page_fault+0x1d9/0x810
> [415478.611398] asm_exc_page_fault+0x22/0x30
> [415478.612311] RIP: 0033:0x7fd71a04b9ba
> [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9
> [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206
> [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610
> [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5
> [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640
> [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699
> [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001
> [415478.625600] </TASK>
> [415478.626087] ---[ end trace 0000000000000000 ]---
>
> Even generic/247 is generating a warning like this from xfs_io,
> which is a mmap vs DIO racer. Given that DIO doesn't exist for
> fsdax, this test turns into just a normal write() vs mmap() racer.
>
> Given these are the same fsdax infrastructure failures that I
> reported for 6.0, it is also likely that ext4 is still throwing
> them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the
> 6.1 cycle.
>
> > Frankly at this point I'm tempted just to turn of fsdax support for XFS
> > for the 6.1 LTS because I don't have time to fix it.
>
> /me shrugs
>
> Backporting fixes (whenever they come along) is a problem for the
> LTS kernel maintainer to deal with, not the upstream maintainer.
>
> IMO, the issue right now is that the DAX maintainers seem to have
> little interest in ensuring that the FSDAX infrastructure actually
> works correctly. If anything, they seem to want to make things
> harder for block based filesystems to use pmem devices and hence
> FSDAX. e.g. the direction of the DAX core away from block interfaces
> that filesystems need for their userspace tools to manage the
> storage.
>
> At what point do we simply say "the experiment failed, FSDAX is
> dead" and remove it from XFS altogether?

A fair question, given the regressions made it all the way into
v6.0-final. In retrospect I made the wrong priority call to focus on dax
page reference counting these past weeks.

When I fired up the dax unit tests on v6.0-rc1 I found basic problems
with the notify failure patches that concerned me that they had never
been tested after the final version was merged [1]. Then the rest of the
development cycle was spent fixing dax reference counting [2]. That was
a longstanding wishlist item from gup and folio developers, but, as I
said, that seems the wrong priority given the lingering regressions. I
will take a look the current dax-xfstests regression backlog. That may
find a need to consider reverting the problematic commits depending on
what is still broken if the fixes are trending towards being invasive.

[1]: https://lore.kernel.org/all/166153426798.2758201.15108211981034512993.stgit@xxxxxxxxxxxxxxxxxxxxxxxxx/

[2]: https://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@xxxxxxxxxxxxxxxxxxxxxxxxx/