Re: POWER9 crash due to STRICT_KERNEL_RWX (WAS: Re: Linux-next POWER9 NULL pointer NIP...)

From: Qian Cai
Date: Thu Apr 16 2020 - 23:16:48 EST




> On Apr 16, 2020, at 10:46 PM, Russell Currey <ruscur@xxxxxxxxxx> wrote:
>
> On Thu, 2020-04-16 at 22:40 -0400, Qian Cai wrote:
>>> On Apr 16, 2020, at 10:27 PM, Russell Currey <ruscur@xxxxxxxxxx>
>>> wrote:
>>>
>>> Reverting the patch with the given config will have the same effect
>>> as
>>> STRICT_KERNEL_RWX=n. Not discounting that it could be a bug on the
>>> powerpc side (i.e. relocatable kernels with strict RWX on haven't
>>> been
>>> exhaustively tested yet), but we should definitely figure out
>>> what's
>>> going on with this bad access first.
>>
>> BTW, this bad access only happened once. The overwhelming rest of
>> crashes are with NULL pointer NIP like below. How can you explain
>> that STRICT_KERNEL_RWX=n would also make those NULL NIP disappear if
>> STRICT_KERNEL_RWX is just a messenger?
>
> What happens if you test with STRICT_KERNEL_RWX=y and RELOCATABLE=n,
> reverting my patch? This would give us an idea of whether it's
> something broken recently or if there's something else going on.

I donât know what did you mean by reverting your patch because that combination
can be tested as-is. Anyway, it could take a long time to reproduce, so Iâll keep it
running for up to 12-hour to confirm it could not really crash.

>
>>
>> [ 215.281666][T16896] LTP: starting chown04_16
>> [ 215.424203][T18297] BUG: Unable to handle kernel instruction fetch
>> (NULL pointer?)
>> [ 215.424289][T18297] Faulting instruction address: 0x00000000
>> [ 215.424313][T18297] Oops: Kernel access of bad area, sig: 11 [#1]
>> [ 215.424341][T18297] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256
>> DEBUG_PAGEALLOC NUMA PowerNV
>> [ 215.424383][T18297] Modules linked in: loop kvm_hv kvm ip_tables
>> x_tables xfs sd_mod bnx2x mdio tg3 ahci libahci libphy libata
>> firmware_class dm_mirror dm_region_hash dm_log dm_mod
>> [ 215.424459][T18297] CPU: 85 PID: 18297 Comm: chown04_16 Tainted:
>> G W 5.6.0-next-20200405+ #3
>> [ 215.424489][T18297] NIP: 0000000000000000 LR: c00800000fbc0408
>> CTR: 0000000000000000
>> [ 215.424530][T18297] REGS: c000200b8606f990 TRAP: 0400 Tainted:
>> G W (5.6.0-next-20200405+)
>> [ 215.424570][T18297] MSR: 9000000040009033
>> <SF,HV,EE,ME,IR,DR,RI,LE> CR: 84000248 XER: 20040000
>> [ 215.424619][T18297] CFAR: c00800000fbc64f4 IRQMASK: 0
>> [ 215.424619][T18297] GPR00: c0000000006c2238 c000200b8606fc20
>> c00000000165ce00 0000000000000000
>> [ 215.424619][T18297] GPR04: c000201a58106400 c000200b8606fcc0
>> 000000005f037e7d ffffffff00013bfb
>> [ 215.424619][T18297] GPR08: c000201a58106400 0000000000000000
>> 0000000000000000 c000000001652ee0
>> [ 215.424619][T18297] GPR12: 0000000000000000 c000201fff69a600
>> 0000000000000000 0000000000000000
>> [ 215.424619][T18297] GPR16: 0000000000000000 0000000000000000
>> 0000000000000000 0000000000000000
>> [ 215.424619][T18297] GPR20: 0000000000000000 0000000000000000
>> 0000000000000000 0000000000000007
>> [ 215.424619][T18297] GPR24: 0000000000000000 0000000000000000
>> c00800000fbc8688 c000200b8606fcc0
>> [ 215.424619][T18297] GPR28: 0000000000000000 000000007fffffff
>> c00800000fbc0400 c00020068b8c0e70
>> [ 215.424914][T18297] NIP [0000000000000000] 0x0
>> [ 215.424953][T18297] LR [c00800000fbc0408] find_free_cb+0x8/0x30
>> [loop]
>> find_free_cb at drivers/block/loop.c:2129
>> [ 215.424997][T18297] Call Trace:
>> [ 215.425036][T18297] [c000200b8606fc20] [c0000000006c2290]
>> idr_for_each+0xf0/0x170 (unreliable)
>> [ 215.425073][T18297] [c000200b8606fca0] [c00800000fbc2744]
>> loop_lookup.part.2+0x4c/0xb0 [loop]
>> loop_lookup at drivers/block/loop.c:2144
>> [ 215.425105][T18297] [c000200b8606fce0] [c00800000fbc3558]
>> loop_control_ioctl+0x120/0x1d0 [loop]
>> [ 215.425149][T18297] [c000200b8606fd40] [c0000000004eb688]
>> ksys_ioctl+0xd8/0x130
>> [ 215.425190][T18297] [c000200b8606fd90] [c0000000004eb708]
>> sys_ioctl+0x28/0x40
>> [ 215.425233][T18297] [c000200b8606fdb0] [c00000000003cc30]
>> system_call_exception+0x110/0x1e0
>> [ 215.425274][T18297] [c000200b8606fe20] [c00000000000c9f0]
>> system_call_common+0xf0/0x278
>> [ 215.425314][T18297] Instruction dump:
>> [ 215.425338][T18297] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
>> XXXXXXXX XXXXXXXX XXXXXXXX
>> [ 215.425374][T18297] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
>> XXXXXXXX XXXXXXXX XXXXXXXX
>> [ 215.425422][T18297] ---[ end trace ebed248fad431966 ]---
>> [ 215.642114][T18297]
>> [ 216.642220][T18297] Kernel panic - not syncing: Fatal exception