Re: power9 NUMA crash while reading debugfs imc_cmd

From: Qian Cai
Date: Thu Jun 27 2019 - 23:34:17 EST




> On Jun 27, 2019, at 11:12 PM, Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote:
>
> Qian Cai <cai@xxxxxx> writes:
>> Read of debugfs imc_cmd file for a memory-less node will trigger a crash below
>> on this power9 machine which has the following NUMA layout.
>
> What type of machine is it?

description: PowerNV
product: 8335-GTH (ibm,witherspoon)
vendor: IBM
width: 64 bits
capabilities: smp powernv opal

>
> cheers
>
>> I don't understand why I only saw it recently on linux-next where it
>> was tested everyday. I can reproduce it back to 4.20 where 4.18 seems
>> work fine.
>>
>> # cat /sys/kernel/debug/powerpc/imc/imc_cmd_252 (On a 4.18-based kernel)
>> 0x0000000000000000
>>
>> # numactl -H
>> available: 6 nodes (0,8,252-255)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
>> 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
>> 53 54 55 56 57 58 59 60 61 62 63
>> node 0 size: 130210 MB
>> node 0 free: 128406 MB
>> node 8 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
>> 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
>> 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
>> node 8 size: 130784 MB
>> node 8 free: 130051 MB
>> node 252 cpus:
>> node 252 size: 0 MB
>> node 252 free: 0 MB
>> node 253 cpus:
>> node 253 size: 0 MB
>> node 253 free: 0 MB
>> node 254 cpus:
>> node 254 size: 0 MB
>> node 254 free: 0 MB
>> node 255 cpus:
>> node 255 size: 0 MB
>> node 255 free: 0 MB
>> node distances:
>> node 0 8 252 253 254 255
>> 0: 10 40 80 80 80 80
>> 8: 40 10 80 80 80 80
>> 252: 80 80 10 80 80 80
>> 253: 80 80 80 10 80 80
>> 254: 80 80 80 80 10 80
>> 255: 80 80 80 80 80 10
>>
>> # cat /sys/kernel/debug/powerpc/imc/imc_cmd_252
>>
>> [ 1139.415461][ T5301] Faulting instruction address: 0xc0000000000d0d58
>> [ 1139.415492][ T5301] Oops: Kernel access of bad area, sig: 11 [#1]
>> [ 1139.415509][ T5301] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=256
>> DEBUG_PAGEALLOC NUMA PowerNV
>> [ 1139.415542][ T5301] Modules linked in: i2c_opal i2c_core ip_tables x_tables
>> xfs sd_mod bnx2x mdio ahci libahci tg3 libphy libata firmware_class dm_mirror
>> dm_region_hash dm_log dm_mod
>> [ 1139.415595][ T5301] CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-
>> 20190627+ #19
>> [ 1139.415634][ T5301] NIP: c0000000000d0d58 LR: c00000000049aa18 CTR:
>> c0000000000d0d50
>> [ 1139.415675][ T5301] REGS: c00020194548f9e0 TRAP: 0300 Not tainted (5.2.0-
>> rc6-next-20190627+)
>> [ 1139.415705][ T5301] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR:
>> 28022822 XER: 00000000
>> [ 1139.415777][ T5301] CFAR: c00000000049aa14 DAR: 000000000003fc08 DSISR:
>> 40000000 IRQMASK: 0
>> [ 1139.415777][ T5301] GPR00: c00000000049aa18 c00020194548fc70 c0000000016f8b00
>> 000000000003fc08
>> [ 1139.415777][ T5301] GPR04: c00020194548fcd0 0000000000000000 0000000014884e73
>> ffffffff00011eaa
>> [ 1139.415777][ T5301] GPR08: 000000007eea5a52 c0000000000d0d50 0000000000000000
>> 0000000000000000
>> [ 1139.415777][ T5301] GPR12: c0000000000d0d50 c000201fff7f8c00 0000000000000000
>> 0000000000000000
>> [ 1139.415777][ T5301] GPR16: 000000000000000d 00007fffeb0c3368 ffffffffffffffff
>> 0000000000000000
>> [ 1139.415777][ T5301] GPR20: 0000000000000000 0000000000000000 0000000000000000
>> 0000000000020000
>> [ 1139.415777][ T5301] GPR24: 0000000000000000 0000000000000000 0000000000020000
>> 000000010ec90000
>> [ 1139.415777][ T5301] GPR28: c00020194548fdf0 c00020049a584ef8 0000000000000000
>> c00020049a584ea8
>> [ 1139.416116][ T5301] NIP [c0000000000d0d58] imc_mem_get+0x8/0x20
>> [ 1139.416143][ T5301] LR [c00000000049aa18] simple_attr_read+0x118/0x170
>> [ 1139.416158][ T5301] Call Trace:
>> [ 1139.416182][ T5301] [c00020194548fc70] [c00000000049a970]
>> simple_attr_read+0x70/0x170 (unreliable)
>> [ 1139.416255][ T5301] [c00020194548fd10] [c00000000054385c]
>> debugfs_attr_read+0x6c/0xb0
>> [ 1139.416305][ T5301] [c00020194548fd60] [c000000000454c1c]
>> __vfs_read+0x3c/0x70
>> [ 1139.416363][ T5301] [c00020194548fd80] [c000000000454d0c] vfs_read+0xbc/0x1a0
>> [ 1139.416392][ T5301] [c00020194548fdd0] [c00000000045519c]
>> ksys_read+0x7c/0x140
>> [ 1139.416434][ T5301] [c00020194548fe20] [c00000000000b108]
>> system_call+0x5c/0x70
>> [ 1139.416473][ T5301] Instruction dump:
>> [ 1139.416511][ T5301] 4e800020 60000000 7c0802a6 60000000 7c801d28 38600000
>> 4e800020 60000000
>> [ 1139.416572][ T5301] 60000000 60000000 7c0802a6 60000000 <7d201c28> 38600000
>> f9240000 4e800020
>> [ 1139.416636][ T5301] ---[ end trace c44d1fb4ace04784 ]---
>> [ 1139.520686][ T5301]
>> [ 1140.520820][ T5301] Kernel panic - not syncing: Fatal exception