Re: [syzbot] riscv/fixes boot error: can't ssh into the instance

From: Dmitry Vyukov
Date: Fri May 27 2022 - 03:17:24 EST


,On Thu, 26 May 2022 at 11:02, Atish Patra <atishp@xxxxxxxxxxxxxx> wrote:
> > On Mon, May 16, 2022 at 5:06 AM Alexandre Ghiti <alex@xxxxxxxx> wrote:
> > >
> > >
> > > On 5/12/22 13:48, Dmitry Vyukov wrote:
> > > > On Fri, 18 Feb 2022 at 14:45, Alexandre Ghiti
> > > > <alexandre.ghiti@xxxxxxxxxxxxx> wrote:
> > > >> Hi Aleksandr,
> > > >>
> > > >> On Thu, Feb 17, 2022 at 6:08 PM Aleksandr Nogikh <nogikh@xxxxxxxxxx> wrote:
> > > >>> Hi Alex,
> > > >>>
> > > >>> On Thu, Feb 17, 2022 at 5:53 PM Alexandre Ghiti
> > > >>> <alexandre.ghiti@xxxxxxxxxxxxx> wrote:
> > > >>>> Aleksandr,
> > > >>>>
> > > >>>> On Wed, Feb 16, 2022 at 5:58 PM Alexandre Ghiti
> > > >>>> <alexandre.ghiti@xxxxxxxxxxxxx> wrote:
> > > >>>>> First, thank you for working on this.
> > > >>>>>
> > > >>>>> On Wed, Feb 16, 2022 at 5:17 PM Aleksandr Nogikh <nogikh@xxxxxxxxxx> wrote:
> > > >>>>>> If I use just defconfig + DEBUG_VIRTUAL, without any KASAN, it begins
> > > >>>>>> to boot, but overwhelms me with tons of `virt_to_phys used for
> > > >>>>>> non-linear address:` errors.
> > > >>>>>>
> > > >>>>>> Like that
> > > >>>>>>
> > > >>>>>> [ 2.701271] virt_to_phys used for non-linear address:
> > > >>>>>> 00000000b59e31b6 (0xffffffff806c2000)
> > > >>>>>> [ 2.701727] WARNING: CPU: 0 PID: 1 at arch/riscv/mm/physaddr.c:16
> > > >>>>>> __virt_to_phys+0x7e/0x86
> > > >>>>>> [ 2.702207] Modules linked in:
> > > >>>>>> [ 2.702393] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W
> > > >>>>>> 5.17.0-rc1 #1
> > > >>>>>> [ 2.702806] Hardware name: riscv-virtio,qemu (DT)
> > > >>>>>> [ 2.703051] epc : __virt_to_phys+0x7e/0x86
> > > >>>>>> [ 2.703298] ra : __virt_to_phys+0x7e/0x86
> > > >>>>>> [ 2.703547] epc : ffffffff80008448 ra : ffffffff80008448 sp :
> > > >>>>>> ffff8f800021bde0
> > > >>>>>> [ 2.703977] gp : ffffffff80ed9b30 tp : ffffaf8001230000 t0 :
> > > >>>>>> ffffffff80eea56f
> > > >>>>>> [ 2.704704] t1 : ffffffff80eea560 t2 : 0000000000000000 s0 :
> > > >>>>>> ffff8f800021be00
> > > >>>>>> [ 2.705153] s1 : ffffffff806c2000 a0 : 000000000000004f a1 :
> > > >>>>>> ffffffff80e723d8
> > > >>>>>> [ 2.705555] a2 : 0000000000000010 a3 : fffffffffffffffe a4 :
> > > >>>>>> 0000000000000000
> > > >>>>>> [ 2.706027] a5 : 0000000000000000 a6 : 0000000000000005 a7 :
> > > >>>>>> ffffffffffffffff
> > > >>>>>> [ 2.706474] s2 : ffffffff80b80b08 s3 : 00000000000000c2 s4 :
> > > >>>>>> ffffffff806c2000
> > > >>>>>> [ 2.706891] s5 : ffffffff80edba10 s6 : ffffffff80edb960 s7 :
> > > >>>>>> 0000000000000001
> > > >>>>>> [ 2.707290] s8 : 00000000000000ff s9 : ffffffff80b80b40 s10:
> > > >>>>>> 00000000000000cc
> > > >>>>>> [ 2.707689] s11: ffffaf807e1fcf00 t3 : 0000000000000076 t4 :
> > > >>>>>> ffffffffffffffff
> > > >>>>>> [ 2.708092] t5 : 00000000000001f2 t6 : ffff8f800021bb48
> > > >>>>>> [ 2.708433] status: 0000000000000120 badaddr: 0000000000000000
> > > >>>>>> cause: 0000000000000003
> > > >>>>>> [ 2.708919] [<ffffffff8011416a>] free_reserved_area+0x72/0x19a
> > > >>>>>> [ 2.709296] [<ffffffff80003a5a>] free_initmem+0x6c/0x7c
> > > >>>>>> [ 2.709648] [<ffffffff805f60c8>] kernel_init+0x3a/0x10a
> > > >>>>>> [ 2.709993] [<ffffffff80002fda>] ret_from_exception+0x0/0xc
> > > >>>>>> [ 2.710310] ---[ end trace 0000000000000000 ]---
> > > >>>>>>
> > > >>>>> I was able to reproduce this: the first one regarding init_zero_pfn is
> > > >>>>> legit but not wrong, I have to check when it was introduced and how to
> > > >>>>> fix this.
> > > >>>>> Regarding the huge batch that follows, at first sight, I would say
> > > >>>>> this is linked to my sv48 patchset but that does not seem important as
> > > >>>>> the address is a kernel mapping address so the use of virt_to_phys is
> > > >>>>> right.
> > > >>>>>
> > > >>>>>> On Wed, Feb 16, 2022 at 5:09 PM Aleksandr Nogikh <nogikh@xxxxxxxxxx> wrote:
> > > >>>>>>> On Wed, Feb 16, 2022 at 12:56 PM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> > > >>>>>>>> On Wed, 16 Feb 2022 at 12:47, Aleksandr Nogikh <nogikh@xxxxxxxxxx> wrote:
> > > >>>>>>>>> On Wed, Feb 16, 2022 at 11:37 AM Aleksandr Nogikh <nogikh@xxxxxxxxxx> wrote:
> > > >>>>>>>>>> Hi Alex,
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Feb 16, 2022 at 5:14 AM Alexandre Ghiti <alex@xxxxxxxx> wrote:
> > > >>>>>>>>>>> Hi Dmitry,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On 2/15/22 18:12, Dmitry Vyukov wrote:
> > > >>>>>>>>>>>> On Wed, 2 Feb 2022 at 14:18, Alexandre Ghiti
> > > >>>>>>>>>>>> <alexandre.ghiti@xxxxxxxxxxxxx> wrote:
> > > >>>>>>>>>>>>> Hi Aleksandr,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Wed, Feb 2, 2022 at 12:08 PM Aleksandr Nogikh <nogikh@xxxxxxxxxx> wrote:
> > > >>>>>>>>>>>>>> Hello,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> syzbot has already not been able to fuzz its RISC-V instance for 97
> > > >>>>>>>>>>>>> That's a longtime, I'll take a look more regularly.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> days now because the compiled kernel cannot boot. I bisected the issue
> > > >>>>>>>>>>>>>> to the following commit:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> commit 54c5639d8f507ebefa814f574cb6f763033a72a5
> > > >>>>>>>>>>>>>> Author: Alexandre Ghiti <alexandre.ghiti@xxxxxxxxxxxxx>
> > > >>>>>>>>>>>>>> Date: Fri Oct 29 06:59:27 2021 +0200
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> riscv: Fix asan-stack clang build
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Apparently, the problem appears on GCC-built RISC-V kernels with KASAN
> > > >>>>>>>>>>>>>> enabled. In the previous message syzbot mentions
> > > >>>>>>>>>>>>>> "riscv64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU
> > > >>>>>>>>>>>>>> Binutils for Debian) 2.35.2", but the issue also reproduces finely on
> > > >>>>>>>>>>>>>> a newer GCC compiler: "riscv64-linux-gnu-gcc (Debian 11.2.0-10)
> > > >>>>>>>>>>>>>> 11.2.0, GNU ld (GNU Binutils for Debian) 2.37".
> > > >>>>>>>>>>>>>> For convenience, I also duplicate the .config file from the bot's
> > > >>>>>>>>>>>>>> message: https://syzkaller.appspot.com/x/.config?x=522544a2e0ef2a7d
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Can someone with KASAN and RISC-V expertise please take a look?
> > > >>>>>>>>>>>>> I'll take a look at that today.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for reporting the issue,
> > > >>>>>>>>>>> I took a quick look, not enough to fix it but I know the issue comes
> > > >>>>>>>>>>> from the inline instrumentation, I have no problem with the outline
> > > >>>>>>>>>>> instrumentation. I need to find some cycles to work on this, my goal is
> > > >>>>>>>>>>> to fix this for 5.17.
> > > >>>>>>>>>> Thanks for the update!
> > > >>>>>>>>>>
> > > >>>>>>>>>> Can you please share the .config with which you tested the outline
> > > >>>>>>>>>> instrumentation?
> > > >>>>>>>>>> I updated the syzbot config to use KASAN_OUTLINE instead of KASAN_INLINE,
> > > >>>>>>>>>> but it still does not boot :(
> > > >>>>>>>>>>
> > > >>>>>>>>>> Here's what I used:
> > > >>>>>>>>>> https://gist.github.com/a-nogikh/279c85c2d24f47efcc3e865c08844138
> > > >>>>>>>>> Update: it doesn't boot with that big config, but boots if I generate
> > > >>>>>>>>> a simple one with KASAN_OUTLINE:
> > > >>>>>>>>>
> > > >>>>>>>>> make defconfig ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-
> > > >>>>>>>>> ./scripts/config -e KASAN -e KASAN_OUTLINE
> > > >>>>>>>>> make olddefconfig ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-
> > > >>>>>>>>>
> > > >>>>>>>>> And it indeed doesn't work if I use KASAN_INLINE.
> > > >>>>>>>> It may be an issue with code size. Full syzbot config + KASAN + KCOV
> > > >>>>>>>> produce hugely massive .text. It may be hitting some limitation in the
> > > >>>>>>>> bootloader/kernel bootstrap code.
> > > >>>>> I took a quick glance and it traps on a KASAN address that is not
> > > >>>>> mapped, either because it is too soon or because the mapping failed
> > > >>>>> somehow.
> > > >>>>>
> > > >>>>> I'll definitely dive into that tomorrow, sorry for being slow here and
> > > >>>>> thanks again for all your work, that helps a lot.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>> Alex
> > > >>>>>
> > > >>>>>>> I bisected the difference between the config we use on syzbot and the
> > > >>>>>>> simple one that was generated like I described above.
> > > >>>>>>> Turns out that it's the DEBUG_VIRTUAL config that makes the difference.
> > > >>>>>>>
> > > >>>>>>> make defconfig ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-
> > > >>>>>>> ./scripts/config -e KASAN -e KASAN_OUTLINE -e DEBUG_VIRTUAL
> > > >>>>>>> make olddefconfig ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-
> > > >>>>>>>
> > > >>>>>>> And the resulting kernel does not boot.
> > > >>>>>>> My env: the `riscv/fixes` branch, commit
> > > >>>>>>> 6df2a016c0c8a3d0933ef33dd192ea6606b115e3, qemu 6.2.0.
> > > >>>> I fixed a few things today: KASAN + SPARSE_VMEMMAP, DEBUG_VIRTUAL and
> > > >>>> maybe KASAN + KCOV.
> > > >>>>
> > > >>>> With those small fixes, I was able to boot your large dotconfig with
> > > >>>> KASAN_OUTLINE, the inline version still fails, this is my next target
> > > >>>> :)
> > > >>>> I'll push that tomorrow!
> > > >>> Awesome, thank you very much!
> > > >>> Looking forward to finally seeing the instance run :)
> > > >> I sent a patchset which should fix your config with *outline* instrumentation.
> > > > Was this fix merged? The riscv instance still does not boot:
> > > > https://syzkaller.appspot.com/bug?id=5f2ff52ad42cba9f222202219baebd4e63e35127
> > >
> > >
> > > Yes it has been in Linus tree since 5.18-rc1. I'll take a look at that
> > > this week.
> > >
> >
> > Are you seeing this error or a different one ? I used the
> > syzkaller_defconfig from the patch below on v5.18.
> >
> > https://lore.kernel.org/all/20220419174952.699-1-palmer@xxxxxxxxxxxx/
> >
> > [ 15.076116][ T1] Mandatory Access Control activated.
> > [ 15.158241][ T1] AppArmor: AppArmor Filesystem Enabled
> > [ 16.150870][ T1] NET: Registered PF_INET protocol family
> > [ 16.166167][ T1] IP idents hash table entries: 32768 (order: 6,
> > 262144 bytes, linear)
> > [ 16.188727][ T1] Unable to handle kernel paging request at
> > virtual address ffebfffeffff2000
> > [ 16.192727][ T1] Oops [#1]
> > [ 16.193479][ T1] Modules linked in:
> > [ 16.194687][ T1] CPU: 3 PID: 1 Comm: swapper/0 Not tainted
> > 5.18.0-00001-g37ac279268bf-dirty #9
> > [ 16.196486][ T1] Hardware name: riscv-virtio,qemu (DT)
> > [ 16.197836][ T1] epc : kasan_check_range+0x9e/0x14e
> > [ 16.199104][ T1] ra : memset+0x1e/0x4c
> > [ 16.200091][ T1] epc : ffffffff804787e0 ra : ffffffff80478f30 sp
> > : ff600000073ffb70
> > [ 16.201420][ T1] gp : ffffffff85879e80 tp : ff600000073f0000 t0
> > : 7300000000000000
> > [ 16.202762][ T1] t1 : ffebfffeffff21ff t2 : 73746e6564692050 s0
> > : ff600000073ffba0
> > [ 16.204047][ T1] s1 : 0000000000001000 a0 : ffebfffeffff2200 a1
> > : 0000000000001000
> > [ 16.205312][ T1] a2 : 0000000000000001 a3 : ffffffff803a4f32 a4
> > : ff5ffffffff90000
> > [ 16.206592][ T1] a5 : ffebfffeffff2000 a6 : 0000004000000000 a7
> > : ff5ffffffff90fff
> > [ 16.207865][ T1] s2 : ff5ffffffff90000 s3 : 0000000000000000 s4
> > : ffffffff8467ea90
> > [ 16.209134][ T1] s5 : 0000000000000000 s6 : ff5ffffffff90000 s7
> > : 0000000000000000
> > [ 16.210394][ T1] s8 : 0000000000001000 s9 : ffffffff8587ca40
> > s10: 0000000000000004
> > [ 16.211952][ T1] s11: ffffffff858a03a0 t3 : 0000000000000000 t4
> > : 0000000000000040
> > [ 16.213469][ T1] t5 : ffebfffeffff2200 t6 : ff600000073ff738
> > [ 16.214853][ T1] status: 0000000200000120 badaddr:
> > ffebfffeffff2000 cause: 000000000000000d
> > [ 16.216910][ T1] Call Trace:
> > [ 16.217816][ T1] [<ffffffff803a4f32>] pcpu_alloc+0x844/0x1254
> > [ 16.219110][ T1] [<ffffffff803a59a0>] __alloc_percpu+0x28/0x34
> > [ 16.220244][ T1] [<ffffffff8328824a>] ip_rt_init+0x17e/0x382
> > [ 16.221606][ T1] [<ffffffff8328861c>] ip_init+0x18/0x30
> > [ 16.222719][ T1] [<ffffffff8328a0ee>] inet_init+0x2a6/0x550
> > [ 16.223863][ T1] [<ffffffff80003204>] do_one_initcall+0x130/0x7dc
> > [ 16.225002][ T1] [<ffffffff83201fbc>] kernel_init_freeable+0x510/0x5b4
> > [ 16.226273][ T1] [<ffffffff8319842a>] kernel_init+0x28/0x21c
> > [ 16.227337][ T1] [<ffffffff80005818>] ret_from_exception+0x0/0x10
> > [ 16.229910][ T1] ---[ end trace 0000000000000000 ]---
> > [ 16.231880][ T1] Kernel panic - not syncing: Fatal exception
> >
> >
>
> Enabling CONFIG_KASAN_VMALLOC=y solves the issue and I am able to boot
> to the userspace.
> I have tried enabling/disabling CONFIG_VMAP_STACK as well. Both works fine.
>
> Looking at the ARM64 Kconfig, KASAN_VMALLOC is enabled if KASAN is enabled.
> This diff seems to work for me.
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 00fd9c548f26..cbf0fe227c77 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -122,6 +122,7 @@ config RISCV
> select TRACE_IRQFLAGS_SUPPORT
> select UACCESS_MEMCPY if !MMU
> select ZONE_DMA32 if 64BIT
> + select KASAN_VMALLOC if KASAN
>
> I am not a kasan expert so I am not sure if this is the correct fix or
> just hides the real issue. pcpu_alloc seems to use vmalloc though.

Hi Atish,

Enabling KASAN_VMALLOC in syzbot config actually makes it boot for me!
It's probably good to enforce it and/or give better diagnostics and/or
fix KASAN w/o KASAN_VMALLOC, but I've sent a fix up for syzbot config
for now:
https://github.com/google/syzkaller/pull/3167
Thanks for looking into it.