Re: d_lookup: Unable to handle kernel paging request

From: Will Deacon
Date: Tue Jun 25 2019 - 05:46:08 EST


[+Marc]

Hi again, Vicente,

On Mon, Jun 24, 2019 at 12:47:41PM +0100, Will Deacon wrote:
> On Sat, Jun 22, 2019 at 08:02:19PM +0200, Vicente Bergas wrote:
> > Hi Al,
> > i think have a hint of what is going on.
> > With the last kernel built with your sentinels at hlist_bl_*lock
> > it is very easy to reproduce the issue.
> > In fact it is so unstable that i had to connect a serial port
> > in order to save the kernel trace.
> > Unfortunately all the traces are at different addresses and
> > your sentinel did not trigger.
> >
> > Now i am writing this email from that same buggy kernel, which is
> > v5.2-rc5-224-gbed3c0d84e7e.
> >
> > The difference is that I changed the bootloader.
> > Before was booting 5.1.12 and kexec into this one.
> > Now booting from u-boot into this one.
> > I will continue booting with u-boot for some time to be sure it is
> > stable and confirm this is the cause.
> >
> > In case it is, who is the most probable offender?
> > the kernel before kexec or the kernel after?
>
> Has kexec ever worked reliably on this board? If you used to kexec
> successfully, then we can try to hunt down the regression using memtest.
> If you kexec into a problematic kernel with CONFIG_MEMTEST=y and pass
> "memtest=17" on the command-line, it will hopefully reveal any active
> memory corruption.
>
> My first thought is that there is ongoing DMA which corrupts the dentry
> hash. The rk3399 SoC also has an IOMMU, which could contribute to the fun
> if it's not shutdown correctly (i.e. if it enters bypass mode).
>
> > The original report was sent to you because you appeared as the maintainer
> > of fs/dcache.c, which appeared on the trace. Should this be redirected
> > somewhere else now?
>
> linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
>
> Probably worth adding Heiko Stuebner <heiko@xxxxxxxxx> to cc.

Before you rush over to LAKML, please could you provide your full dmesg
output from the kernel that was crashing (i.e. the dmesg you see in the
kexec'd kernel)? We've got a theory that the issue may be related to the
interrupt controller, and the dmesg output should help to establish whether
that is plausible or not.

Thanks,

Will