Re: [PATCH v3 0/8] Support for transparent PUD pages for DAX files

From: Darrick J. Wong
Date: Fri Jan 15 2016 - 14:42:11 EST


On Fri, Jan 08, 2016 at 02:49:44PM -0500, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@xxxxxxxxxxxxxxx>
>
> Andrew, I think this is ready for a spin in -mm.
>
> v3: Rebased against current mmtom
> v2: Reduced churn in filesystems by switching to ->huge_fault interface
> Addressed concerns from Kirill
>
> We have customer demand to use 1GB pages to map DAX files. Unlike the 2MB
> page support, the Linux MM does not currently support PUD pages, so I have
> attempted to add support for the necessary pieces for DAX huge PUD pages.
>
> Filesystems still need work to allocate 1GB pages. With ext4, I can
> only get 16MB of contiguous space, although it is aligned. With XFS,
> I can get 80MB less than 1GB, and it's not aligned. The XFS problem
> may be due to the small amount of RAM in my test machine.

"It's not aligned"... I don't know the details of what you're trying to do, but
are you trying to create a file where each GB of logical address space maps to
a contiguous GB of physical space, and both logical and physical offsets align
to a 1GB boundary?

If the XFS is formatted with stripe unit/width of 1G, an extent size hint of 1G
is put on the file, and the whole file is allocated in 1G chunks, I think
you're supposed to be able to make the above happen:

# mkfs.xfs /dev/mapper/moo -f -d su=1g,sw=1
meta-data=/dev/mapper/moo isize=512 agcount=34, agsize=8126464 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=268435456, imaxpct=5
= sunit=262144 swidth=262144 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=131072, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# mount /dev/mapper/moo /mnt
# xfs_io -f -c 'extsize 1g' -c 'falloc 0 200g' /mnt/urk
# filefrag -v /mnt/urk
Filesystem type is: 58465342
File size of /mnt/urk is 214748364800 (52428800 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 7340031: 524288.. 7864319: 7340032: unwritten
1: 7340032..14680063: 8388608.. 15728639: 7340032: 7864320: unwritten
2: 14680064..22020095: 16515072.. 23855103: 7340032: 15728640: unwritten
3: 22020096..29360127: 24641536.. 31981567: 7340032: 23855104: unwritten
4: 29360128..36700159: 32768000.. 40108031: 7340032: 31981568: unwritten
5: 36700160..40370175: 40894464.. 44564479: 3670016: 40108032: unwritten
6: 40370176..44040191: 44826624.. 48496639: 3670016: 44564480: unwritten
7: 44040192..51380223: 49020928.. 56360959: 7340032: 48496640: unwritten
8: 51380224..52428799: 57147392.. 58195967: 1048576: 56360960: last,unwritten,eof
/mnt/urk: 9 extents found

AFAICT each extent's logical and physical offsets are aligned to a 1G boundary.

<shrug> Just a shot in the dark.

(This VM has 2G of memory and 1T of fake disk.)

--D

>
> This patch set is against something approximately current -mm. I'd like
> to thank Dave Chinner & Kirill Shutemov for their reviews of v1.
> The conversion of pmd_fault & pud_fault to huge_fault is thanks to
> Dave's poking, and Kirill spotted a couple of problems in the MM code.
> Version 2 of the patch set is about 200 lines smaller (1016 insertions,
> 23 deletions in v1).
>
> I've done some light testing using a program to mmap a block device
> with DAX enabled, calling mincore() and examining /proc/smaps and
> /proc/pagemap.
>
> Matthew Wilcox (8):
> mm: Convert an open-coded VM_BUG_ON_VMA
> mm,fs,dax: Change ->pmd_fault to ->huge_fault
> mm: Add support for PUD-sized transparent hugepages
> mincore: Add support for PUDs
> procfs: Add support for PUDs to smaps, clear_refs and pagemap
> x86: Add support for PUD-sized transparent hugepages
> dax: Support for transparent PUD pages
> ext4: Support for PUD-sized transparent huge pages
>
> Documentation/filesystems/dax.txt | 12 +-
> arch/Kconfig | 3 +
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/paravirt.h | 11 ++
> arch/x86/include/asm/paravirt_types.h | 2 +
> arch/x86/include/asm/pgtable.h | 94 ++++++++++++
> arch/x86/include/asm/pgtable_64.h | 13 ++
> arch/x86/kernel/paravirt.c | 1 +
> arch/x86/mm/pgtable.c | 31 ++++
> fs/block_dev.c | 10 +-
> fs/dax.c | 272 +++++++++++++++++++++++++---------
> fs/ext2/file.c | 27 +---
> fs/ext4/file.c | 60 +++-----
> fs/proc/task_mmu.c | 109 ++++++++++++++
> fs/xfs/xfs_file.c | 25 ++--
> fs/xfs/xfs_trace.h | 2 +-
> include/asm-generic/pgtable.h | 62 +++++++-
> include/asm-generic/tlb.h | 14 ++
> include/linux/dax.h | 17 ---
> include/linux/huge_mm.h | 50 +++++++
> include/linux/mm.h | 43 +++++-
> include/linux/mmu_notifier.h | 13 ++
> include/linux/pfn_t.h | 8 +
> mm/huge_memory.c | 151 +++++++++++++++++++
> mm/memory.c | 101 +++++++++++--
> mm/mincore.c | 13 ++
> mm/pagewalk.c | 19 ++-
> mm/pgtable-generic.c | 14 ++
> 28 files changed, 980 insertions(+), 198 deletions(-)
>
> --
> 2.6.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html