Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

From: Dan Williams
Date: Thu Oct 27 2022 - 21:37:53 EST


Darrick J. Wong wrote:
> [add tytso to cc since he asked about "How do you actually /get/ fsdax
> mode these days?" this morning]
>
> On Tue, Oct 25, 2022 at 10:56:19AM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 25, 2022 at 02:26:50PM +0000, ruansy.fnst@xxxxxxxxxxx wrote:
> > >
> > >
> > > 在 2022/10/24 13:31, Dave Chinner 写道:
> > > > On Mon, Oct 24, 2022 at 03:17:52AM +0000, ruansy.fnst@xxxxxxxxxxx wrote:
> > > >> 在 2022/10/24 6:00, Dave Chinner 写道:
> > > >>> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> > > >>>> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> > > >>>>> In addition, I don't like your idea about the test change because it will
> > > >>>>> make generic/470 become the special test for XFS. Do you know if we can fix
> > > >>>>> the issue by changing the test in another way? blkdiscard -z can fix the
> > > >>>>> issue because it does zero-fill rather than discard on the block device.
> > > >>>>> However, blkdiscard -z will take a lot of time when the block device is
> > > >>>>> large.
> > > >>>>
> > > >>>> Well we /could/ just do that too, but that will suck if you have 2TB of
> > > >>>> pmem. ;)
> > > >>>>
> > > >>>> Maybe as an alternative path we could just create a very small
> > > >>>> filesystem on the pmem and then blkdiscard -z it?
> > > >>>>
> > > >>>> That said -- does persistent memory actually have a future? Intel
> > > >>>> scuttled the entire Optane product, cxl.mem sounds like expansion
> > > >>>> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> > > >>>> asserts everywhere) and 6.1 (every time I run fstests now I see massive
> > > >>>> data corruption).
> > > >>>
> > > >>> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> > > >>> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> > > >>> don't think that has changed at all - I still see lots of kernel
> > > >>> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> > > >>> argument" errors.
> > > >>
> > > >> Firstly, I think the "XFS_IOC_CLONE_RANGE: Invalid argument" error is
> > > >> caused by the restrictions which prevent reflink work together with DAX:
> > > >>
> > > >> a. fs/xfs/xfs_ioctl.c:1141
> > > >> /* Don't allow us to set DAX mode for a reflinked file for now. */
> > > >> if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
> > > >> return -EINVAL;
> > > >>
> > > >> b. fs/xfs/xfs_iops.c:1174
> > > >> /* Only supported on non-reflinked files. */
> > > >> if (xfs_is_reflink_inode(ip))
> > > >> return false;
> > > >>
> > > >> These restrictions were removed in "drop experimental warning" patch[1].
> > > >> I think they should be separated from that patch.
> > > >>
> > > >> [1]
> > > >> https://lore.kernel.org/linux-xfs/1663234002-17-1-git-send-email-ruansy.fnst@xxxxxxxxxxx/
> > > >>
> > > >>
> > > >> Secondly, how the data corruption happened?
> > > >
> > > > No idea - i"m just reporting that lots of fsx tests failed with data
> > > > corruptions. I haven't had time to look at why, I'm still trying to
> > > > sort out the fix for a different data corruption...
> > > >
> > > >> Or which case failed?
> > > >
> > > > *lots* of them failed with kernel warnings with reflink turned off:
> > > >
> > > > SECTION -- xfs_dax_noreflink
> > > > =========================
> > > > Failures: generic/051 generic/068 generic/075 generic/083
> > > > generic/112 generic/127 generic/198 generic/231 generic/247
> > > > generic/269 generic/270 generic/340 generic/344 generic/388
> > > > generic/461 generic/471 generic/476 generic/519 generic/561 xfs/011
> > > > xfs/013 xfs/073 xfs/297 xfs/305 xfs/517 xfs/538
> > > > Failed 26 of 1079 tests
> > > >
> > > > All of those except xfs/073 and generic/471 are failures due to
> > > > warnings found in dmesg.
> > > >
> > > > With reflink enabled, I terminated the run after g/075, g/091, g/112
> > > > and generic/127 reported fsx data corruptions and g/051, g/068,
> > > > g/075 and g/083 had reported kernel warnings in dmesg.
> > > >
> > > >> Could
> > > >> you give me more info (such as mkfs options, xfstests configs)?
> > > >
> > > > They are exactly the same as last time I reported these problems.
> > > >
> > > > For the "no reflink" test issues:
> > > >
> > > > mkfs options are "-m reflink=0,rmapbt=1", mount options "-o
> > > > dax=always" for both filesytems. Config output at start of test
> > > > run:
> > > >
> > > > SECTION -- xfs_dax_noreflink
> > > > FSTYP -- xfs (debug)
> > > > PLATFORM -- Linux/x86_64 test3 6.1.0-rc1-dgc+ #1615 SMP PREEMPT_DYNAMIC Wed Oct 19 12:24:16 AEDT 2022
> > > > MKFS_OPTIONS -- -f -m reflink=0,rmapbt=1 /dev/pmem1
> > > > MOUNT_OPTIONS -- -o dax=always -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch
> > > >
> > > > pmem devices are a pair of fake 8GB pmem regions set up by kernel
> > > > CLI via "memmap=8G!15G,8G!24G". I don't have anything special set up
> > > > - the kernel config is kept minimal for these VMs - and the only
> > > > kernel debug option I have turned on for these specific test runs is
> > > > CONFIG_XFS_DEBUG=y.
> > >
> > > Thanks for the detailed info. But, in my environment (and my
> > > colleagues', and our real server with DCPMM) these failure cases (you
> > > mentioned above, in dax+non_reflink mode, with same test options) cannot
> > > reproduce.
> > >
> > > Here's our test environment info:
> > > - Ruan's env: fedora 36(v6.0-rc1) on kvm,pmem 2x4G:file backended
> > > - Yang's env: fedora 35(v6.1-rc1) on kvm,pmem 2x1G:memmap=1G!1G,1G!2G
> > > - Server's : Ubuntu 20.04(v6.0-rc1) real machine,pmem 2x4G:real DCPMM
> > >
> > > (To quickly confirm the difference, I just ran the failed 26 cases you
> > > mentioned above.) Except for generic/471 and generic/519, which failed
> > > even when dax is off, the rest passed.
> > >
> > >
> > > We don't want fsdax to be truned off. Right now, I think the most
> > > important thing is solving the failed cases in dax+non_reflink mode.
> > > So, firstly, I have to reproduce those failures. Is there any thing
> > > wrong with my test environments? I konw you are using 'memmap=XXG!YYG' to
> > > simulate pmem. So, (to Darrick) could you show me your config of dev
> > > environment and the 'testcloud'(I am guessing it's a server with real
> > > nvdimm just like ours)?
> >
> > Nope. Since the announcement of pmem as a product, I have had 15
> > minutes of acces to one preproduction prototype server with actual
> > optane DIMMs in them.
> >
> > I have /never/ had access to real hardware to test any of this, so it's
> > all configured via libvirt to simulate pmem in qemu:
> > https://lore.kernel.org/linux-xfs/YzXsavOWMSuwTBEC@magnolia/
> >
> > /run/mtrdisk/[gh].mem are both regular files on a tmpfs filesystem:
> >
> > $ grep mtrdisk /proc/mounts
> > none /run/mtrdisk tmpfs rw,relatime,size=82894848k,inode64 0 0
> >
> > $ ls -la /run/mtrdisk/[gh].mem
> > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 18:09 /run/mtrdisk/g.mem
> > -rw-r--r-- 1 libvirt-qemu kvm 10739515392 Oct 24 19:28 /run/mtrdisk/h.mem
>
> Also forgot to mention that the VM with the fake pmem attached has a
> script to do:
>
> ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f
> ndctl create-namespace --mode fsdax --map dev -e namespace1.0 -f
>
> Every time the pmem device gets recreated, because apparently that's the
> only way to get S_DAX mode nowadays?

If you have noticed a change here it is due to VM configuration not
anything in the driver.

If you are interested there are two ways to get pmem declared the legacy
way that predates any of the DAX work, the kernel calls it E820_PRAM,
and the modern way by platform firmware tables like ACPI NFIT. The
assumption with E820_PRAM is that it is dealing with battery backed
NVDIMMs of small capacity. In that case the /dev/pmem device can support
DAX operation by default because the necessary memory for the 'struct
page' array for that memory is likely small.

Platform firmware defined PMEM can be terabytes. So the driver does not
enable DAX by default because the user needs to make policy choice about
burning gigabytes of DRAM for that metadata, or placing it in PMEM which
is abundant, but slower. So what I suspect might be happening is your
configuration changed from something that auto-allocated the 'struct
page' array, to something that needed those commands you list above to
explicitly opt-in to reserving some PMEM capacity for the page metadata.