dcache_readdir NULL inode oops

From: Jan Glauber
Date: Fri Nov 09 2018 - 09:38:37 EST


Hi Al,

I'm seeing the following oops reproducible with upstream kernel on arm64 (ThunderX2):

[ 5428.795719] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040
[ 5428.813838] Mem abort info:
[ 5428.820721] ESR = 0x96000006
[ 5428.828476] Exception class = DABT (current EL), IL = 32 bits
[ 5428.841590] SET = 0, FnV = 0
[ 5428.848939] EA = 0, S1PTW = 0
[ 5428.855941] Data abort info:
[ 5428.862422] ISV = 0, ISS = 0x00000006
[ 5428.870787] CM = 0, WnR = 0
[ 5428.877359] user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000052f9e034
[ 5428.891098] [0000000000000040] pgd=0000007ebb0d6003, pud=0000007ed3073003, pmd=0000000000000000
[ 5428.909251] Internal error: Oops: 96000006 [#1] SMP
[ 5428.919122] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ipmi_ssif ip_tables x_tables ipv6 crc32_ce bnx2x crct10dif_ce igb nvme nvme_core i2c_algo_bit mdio gpio_xlp i2c_xlp9xx
[ 5428.972724] CPU: 45 PID: 220018 Comm: stress-ng-dev Not tainted 4.19.0-jang+ #45
[ 5428.987664] Hardware name: To be filled by O.E.M. Saber/To be filled by O.E.M., BIOS 0ACKL018 03/30/2018
[ 5429.006819] pstate: 60400009 (nZCv daif +PAN -UAO)
[ 5429.016567] pc : dcache_readdir+0xfc/0x1a8
[ 5429.024903] lr : dcache_readdir+0x134/0x1a8
[ 5429.033376] sp : ffff00002d553d70
[ 5429.040101] x29: ffff00002d553d70 x28: ffff807db4988000
[ 5429.050892] x27: 0000000000000000 x26: 0000000000000000
[ 5429.061679] x25: 0000000056000000 x24: ffff8024577106c0
[ 5429.072457] x23: 0000000000000000 x22: ffff80267b92a480
[ 5429.083248] x21: ffff80267b92a520 x20: ffff8024575e5e00
[ 5429.094029] x19: ffff00002d553e40 x18: 0000000000000000
[ 5429.104805] x17: 0000000000000000 x16: 0000000000000000
[ 5429.115553] x15: 0000000000000000 x14: 0000000000000000
[ 5429.126332] x13: 0000000000000000 x12: 0000000000000000
[ 5429.137096] x11: 0000000000000000 x10: ffff80266b398228
[ 5429.147849] x9 : ffff80266b398000 x8 : 0000000000007e4e
[ 5429.158580] x7 : 0000000000000000 x6 : ffff00000830d190
[ 5429.169362] x5 : 0000000000000000 x4 : ffff00000d7506a8
[ 5429.180123] x3 : 0000000000000002 x2 : 0000000000000002
[ 5429.190890] x1 : ffff8024575e5e38 x0 : ffff00002d553e40
[ 5429.201715] Process stress-ng-dev (pid: 220018, stack limit = 0x000000009437ac28)
[ 5429.216828] Call trace:
[ 5429.221855] dcache_readdir+0xfc/0x1a8
[ 5429.229459] iterate_dir+0x8c/0x1a0
[ 5429.236561] ksys_getdents64+0xa4/0x188
[ 5429.244357] __arm64_sys_getdents64+0x28/0x38
[ 5429.253201] el0_svc_handler+0x7c/0x100
[ 5429.260989] el0_svc+0x8/0xc
[ 5429.266878] Code: a9429681 aa1303e0 b9402682 a9400e66 (f94020a4)
[ 5429.279192] ---[ end trace 5c1e28c07cf016c5 ]---

It happens after 1-3 hours of running 'stress-ng --dev 128'. This testcase does a scandir of /dev
and then calls random stuff like ioctl, lseek, open/close etc. on the entries. I assume no files are
deleted under /dev during the testcase.

The NULL pointer is the inode pointer of next. The next dentry->d_flags is DCACHE_RCUACCESS
when this happens.

Any hints on how to further debug this?

--Jan