Re: "kernel ade access" oops on LoongArch

From: Huacai Chen
Date: Tue Feb 14 2023 - 09:52:08 EST


Hi, Ruoyao,

It seems to have something related to Youling's relative exception patchset.

Huacai

On Tue, Feb 14, 2023 at 4:46 PM Xi Ruoyao <xry111@xxxxxxxxxxx> wrote:
>
> This is a "help wanted" message :(.
>
> I've recently run into some strange kernel oops testing Glibc for LoongArch. A log looks like:
>
> [11569.195043] Kernel ade access[#1]:
> [11569.198441] CPU: 1 PID: 1132296 Comm: ld-linux-loonga Not tainted 6.2.0-rc8+ #61
> [11569.205792] Hardware name: Loongson Loongson-3A5000-HV-7A2000-1w-V0.1-EVB/Loongson-LS3A5000-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05383-beta10 1
> [11569.219536] $ 0 : 0000000000000000 90000000005e3448 90000001113a0000 90000001113a3ab0
> [11569.227505] $ 4 : 90000001113a3af8 1000000000cf16d0 5555555555555850 000000000000000c
> [11569.235475] $ 8 : 90000000009caa10 0000000000000000 00000000000002ca 000000000000008b
> [11569.243438] $12 : 0000000000000001 9000000000cf1258 ffffffffffffffff 00007ffffb93c000
> [11569.251402] $16 : 0000000000000000 0000000000000140 0000000000000000 0000000000000020
> [11569.259366] $20 : 90000001113a3ec8 9000000000a97ee0 00007ffffb93bfa0 1555555555555613
> [11569.267334] $24 : 1000000000cf16d0 000000000000000c 9000000000cf1258 90000000009caa10
> [11569.275303] $28 : 90000001113a3af8 0aaaaaaaaaaaab0a 00007ffffb93bde0 90000001113a3ec0
> [11569.283268] era : 90000000009caa10 cmp_ex_search+0x0/0x28
> [11569.288814] ra : 90000000005e3448 bsearch+0x58/0xa8
> [11569.293921] CSR crmd: 000000b0
> [11569.293923] CSR prmd: 00000004
> [11569.297037] CSR euen: 00000000
> [11569.300152] CSR ecfg: 00071c1c
> [11569.303266] CSR estat: 00480000
> [11569.309587] ExcCode : 8 (SubCode 1)
> [11569.313049] BadVA : 1000000000cf16d0
> [11569.316596] PrId : 0014c011 (Loongson-64bit)
> [11569.320923] Modules linked in: amdgpu nls_cp936 vfat fat input_leds drm_ttm_helper ttm video gpu_sched drm_buddy snd_hda_codec_generic drm_display_helper ledtrig_audio drm_kms_helper led_class snd_hda_intel sha256_generic snd_intel_dspcfg cfbfillrect libsha256 snd_hda_codec syscopyarea snd_hda_core hid_generic cfbimgblt cfg80211 snd_pcm sysfillrect usbhid sysimgblt snd_timer cfbcopyarea hid snd igb soundcore efivarfs
> [11569.357709] Process ld-linux-loonga (pid: 1132296, threadinfo=000000003cbd0caa, task=000000005bcd27a6)
> [11569.366977] Stack : 00007ffffb93bd60 0000000000000000 9000000180a36a40 0000000000000001
> [11569.374940] 90000001113a3bb0 00007ffffb93c000 9000000000224c94 90000000009cab2c
> [11569.382899] 0000000000000001 9000000000224c94 00007ffff3258000 900000000025a1b4
> [11569.390866] 90000001113a3bb0 900000000022f4cc 00007ffffb93c000 900000000022f74c
> [11569.398834] 9000000180a36a40 0000000000000001 0000000000000000 00007ffffb93c000
> [11569.406800] 90000001113a3bb0 900000000022f8f8 90000001113a3ec0 00007ffffb93bde0
> [11569.414768] 00007ffffb93bd60 0000000000000000 0000000000000000 00007fffff7c4600
> [11569.422734] 9000000182ebab70 9000000000d08000 0000000046505501 900000000022ee6c
> [11569.430698] 0000000000000000 9000000000224b84 90000001113a0000 90000001113a3cf0
> [11569.438661] 0000000000000000 00007ffffb93c0d0 0000000000000000 0000000000000040
> [11569.446627] ...
> [11569.449058] Call Trace:
> [11569.449062] [<90000000009caa10>] cmp_ex_search+0x0/0x28
> [11569.456681] [<90000000005e3448>] bsearch+0x58/0xa8
> [11569.461443] [<90000000009cab2c>] search_extable+0x28/0x34
> [11569.466807] [<900000000025a1b4>] search_exception_tables+0x48/0x7c
> [11569.472953] [<900000000022f4cc>] fixup_exception+0x18/0xcc
> [11569.478410] [<900000000022f74c>] do_sigsegv+0x174/0x1b0
> [11569.483605] [<900000000022f8f8>] do_page_fault+0x170/0x344
> [11569.489058] [<900000000022ee6c>] tlb_do_page_fault_1+0x128/0x1c4
> [11569.495029] [<9000000000224b84>] handle_signal+0x634/0x884
> [11569.500487] [<9000000000225704>] arch_do_signal_or_restart+0xb4/0xe0
> [11569.506808] [<90000000002b5b30>] exit_to_user_mode_prepare+0xbc/0x100
> [11569.513214] [<9000000000a02628>] syscall_exit_to_user_mode+0x30/0x4c
> [11569.519533] [<90000000002214a4>] handle_syscall+0xc4/0x160
>
> [11569.526472] Code: 4c000020 02800404 4c000020 <240000ac> 26000084 0010b0a5 680014a4 00129484 00111004
>
> [11569.537704] ---[ end trace 0000000000000000 ]---
>
> "BadVA : 1000000000cf16d0" may suggest the highest bit of an address is
> somehow cleared.
>
> The issue is not deterministic, but it seems easily reproduced by:
>
> 1. Compile Glibc:
>
> ../glibc/configure --prefix=/usr \
> --disable-werror \
> --enable-kernel=5.19 \
> --enable-stack-protector=strong \
> --with-headers=/usr/include \
> libc_cv_slibdir=/usr/lib
> make -j4
>
> 2. Check Glibc:
>
> make check -j4
>
> 3. If the oops did not happen during the last step, run a specific test
> in a dead loop:
>
> while true; do make test t=malloc/tst-mallocfork3-malloc-check; done
>
> Then an oops would likely show up in several minutes.
>
> Though the oops is nondeterministic, I'm almost sure it's not a hardware
> stability issue because I'm getting exactly same stack traces for each
> oops message. I cannot easily rule out the possibility about "the
> compiler miscompiles kernel code" though.
>
> I'm running 6.2-rc8 with the following patches from loongarch-next:
>
> ACPI: Define ACPI_MACHINE_WIDTH to 64 for LoongArch
> PCI: loongson: Improve the MRRS quirk for LS7A
> PCI: Add quirk for LS7A to avoid reboot failure
> irqchip/loongson-liointc: Save/restore int_edge/int_pol registers during S3/S4
> LoongArch: Add vector extensions support
> tools: Add LoongArch build infrastructure
> libbpf: Add LoongArch support to bpf_tracing.h
> selftests/seccomp: Add LoongArch selftesting support
> SH: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
> LoongArch: Add CPU HWMon platform driver
>
> Any idea to fix the issue or suggestion to debug it further?
>
> --
> Xi Ruoyao <xry111@xxxxxxxxxxx>
> School of Aerospace Science and Technology, Xidian University