Re: mm: WARNING in __delete_from_page_cache

From: Ross Zwisler
Date: Thu Jan 28 2016 - 14:52:05 EST


On Wed, Jan 27, 2016 at 03:04:21PM -0700, Ross Zwisler wrote:
> On Tue, Jan 26, 2016 at 04:59:27PM +0000, Williams, Dan J wrote:
> > 8<---- (git am --scissors)
> > Subject: fs, block: force direct-I/O for dax-enabled block devices
> >
> > From: Dan Williams <dan.j.williams@xxxxxxxxx>
> >
> > Similar to the file I/O path, re-direct all I/O to the DAX path for I/O
> > to a block-device special file.  Both regular files and device special
> > files can use the common filp->f_mapping->host lookup to determing is
> > DAX is enabled.
> >
> > Otherwise, we confuse the DAX code that does not expect to find live
> > data in the page cache:
> >
> >     ------------[ cut here ]------------
> >     WARNING: CPU: 0 PID: 7676 at mm/filemap.c:217
> >     __delete_from_page_cache+0x9f6/0xb60()
> >     Modules linked in:
> >     CPU: 0 PID: 7676 Comm: a.out Not tainted 4.4.0+ #276
> >     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >      00000000ffffffff ffff88006d3f7738 ffffffff82999e2d 0000000000000000
> >      ffff8800620a0000 ffffffff86473d20 ffff88006d3f7778 ffffffff81352089
> >      ffffffff81658d36 ffffffff86473d20 00000000000000d9 ffffea0000009d60
> >     Call Trace:
> >      [<     inline     >] __dump_stack lib/dump_stack.c:15
> >      [<ffffffff82999e2d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
> >      [<ffffffff81352089>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
> >      [<ffffffff813522b9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
> >      [<ffffffff81658d36>] __delete_from_page_cache+0x9f6/0xb60 mm/filemap.c:217
> >      [<ffffffff81658fb2>] delete_from_page_cache+0x112/0x200 mm/filemap.c:244
> >      [<ffffffff818af369>] __dax_fault+0x859/0x1800 fs/dax.c:487
> >      [<ffffffff8186f4f6>] blkdev_dax_fault+0x26/0x30 fs/block_dev.c:1730
> >      [<     inline     >] wp_pfn_shared mm/memory.c:2208
> >      [<ffffffff816e9145>] do_wp_page+0xc85/0x14f0 mm/memory.c:2307
> >      [<     inline     >] handle_pte_fault mm/memory.c:3323
> >      [<     inline     >] __handle_mm_fault mm/memory.c:3417
> >      [<ffffffff816ecec3>] handle_mm_fault+0x2483/0x4640 mm/memory.c:3446
> >      [<ffffffff8127eff6>] __do_page_fault+0x376/0x960 arch/x86/mm/fault.c:1238
> >      [<ffffffff8127f738>] trace_do_page_fault+0xe8/0x420 arch/x86/mm/fault.c:1331
> >      [<ffffffff812705c4>] do_async_page_fault+0x14/0xd0 arch/x86/kernel/kvm.c:264
> >      [<ffffffff86338f78>] async_page_fault+0x28/0x30 arch/x86/entry/entry_64.S:986
> >      [<ffffffff86336c36>] entry_SYSCALL_64_fastpath+0x16/0x7a
> >     arch/x86/entry/entry_64.S:185
> >     ---[ end trace dae21e0f85f1f98c ]---
> >
> > Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> > Fixes: 5a023cdba50c ("block: enable dax for raw block devices")
> > Reported-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
> > Reported-by: Kirill A. Shutemov <kirill@xxxxxxxxxxxxx>
> > Suggested-by: Jan Kara <jack@xxxxxxx>
> > Reviewed-by: Jan Kara <jack@xxxxxxx>
> > Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxxxx>
> > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
>
> I had a test case where I was hitting a warning while inserting into the page
> cache when the inode was supposed to be DAX, and this clears up my issue as
> well.
>
> Tested-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
>
> > ---
> >  include/linux/fs.h |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 1a2046275cdf..b10002d4a5f5 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2907,7 +2907,7 @@ extern void replace_mount_options(struct super_block *sb, char *options);
> >  
> >  static inline bool io_is_direct(struct file *filp)
> >  {
> > - return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
> > + return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> >  }
> >  
> >  static inline int iocb_flags(struct file *file)

My testing has turned up another case where we can end up doing both page
cache I/O and DAX I/O to the same raw block device. Here is the stack trace
of the error, passed through kasan_symbolize.py:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 598 at mm/filemap.c:590 __add_to_page_cache_locked+0x30d/0x3c0()
Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
CPU: 1 PID: 598 Comm: systemd-udevd Tainted: G W 4.5.0-rc1+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
0000000000000000 0000000001e02977 ffff88040ff9fa88 ffffffff81576db2
0000000000000000 ffff88040ff9fac0 ffffffff810a65a2 ffff8804103d9020
0000000000000000 ffff8804103d9008 00000000ffffffea ffffea0002ebf080
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff81576db2>] dump_stack+0x44/0x62 lib/dump_stack.c:50
[<ffffffff810a65a2>] warn_slowpath_common+0x82/0xc0 kernel/panic.c:482
[<ffffffff810a66ea>] warn_slowpath_null+0x1a/0x20 kernel/panic.c:515
[< inline >] page_cache_tree_insert mm/filemap.c:590
[<ffffffff811ca32d>] __add_to_page_cache_locked+0x30d/0x3c0 mm/filemap.c:649
[<ffffffff811ca449>] add_to_page_cache_lru+0x49/0xd0 mm/filemap.c:697
[< inline >] __read_cache_page mm/filemap.c:2300
[<ffffffff811ca87b>] do_read_cache_page+0x6b/0x300 mm/filemap.c:2330
[<ffffffff811cab2c>] read_cache_page+0x1c/0x20 mm/filemap.c:2377
[< inline >] read_mapping_page include/linux/pagemap.h:391
[<ffffffff81558d74>] read_dev_sector+0x34/0xf0 block/partition-generic.c:558
[< inline >] read_part_sector block/partitions/check.h:37
[<ffffffff81560c4e>] read_lba+0x18e/0x290 block/partitions/efi.c:264
[< inline >] find_valid_gpt block/partitions/efi.c:610
[<ffffffff81561542>] efi_partition+0xf2/0x7d0 block/partitions/efi.c:692
[<ffffffff8155b47e>] check_partition+0x13e/0x220 block/partitions/check.c:166
[<ffffffff81559670>] rescan_partitions+0xc0/0x2b0 block/partition-generic.c:434
[<ffffffff81553aeb>] __blkdev_reread_part+0x6b/0xa0 block/ioctl.c:171
[<ffffffff81553b45>] blkdev_reread_part+0x25/0x40 block/ioctl.c:191
[<ffffffff815547b9>] blkdev_ioctl+0x5a9/0xab0 block/ioctl.c:624
[<ffffffff812a2303>] block_ioctl+0x43/0x50 fs/block_dev.c:1624
[< inline >] vfs_ioctl fs/ioctl.c:43
[<ffffffff81274452>] do_vfs_ioctl+0xa2/0x6a0 fs/ioctl.c:674
[< inline >] SYSC_ioctl fs/ioctl.c:689
[<ffffffff81274ac9>] SyS_ioctl+0x79/0x90 fs/ioctl.c:680
[<ffffffff81a6a732>] entry_SYSCALL_64_fastpath+0x12/0x76 arch/x86/entry/entry_64.S:185
---[ end trace 96171b39eee8b31b ]---

Here is my minimal reproducer:

#define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/falloc.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>

#define PAGE(a) ((a)*0x1000)

int main(int argc, char *argv[])
{
int i, fd;
char *data_array = (char*) 0x10200000; /* request a 2MiB aligned address with mmap() */
int a;

fd = open("/dev/pmem0", O_RDWR);
if (fd < 0) {
perror("fd");
return 1;
}

data_array = mmap(data_array, PAGE(0x300), PROT_READ|PROT_WRITE,
MAP_SHARED, fd, PAGE(0));

data_array[PAGE(0x0)] = 1;
close(fd);
return 0;
}

I believe what is happening is that I am doing a DAX PMD fault, and that fault
is happening at the same time that an IOCTL is doing a scan for partition
tables. The above stack trace is for a scan of an efi_partition(), but I get a
similar stack trace going through sgi_partition(), ldm_partition(),
msdos_partition(), etc. This looping is happening in check_partition(), as it
loops through the various partition types.

I think the issue essentially is that the partition scanning code doesn't have
a direct I/O code path, at least from what I have found. This means that it's
not calling into DAX, and instead is just using the page cache.