Re: Regression: memory corruption on Atmel SAMA5D31

From: Saravana Kannan
Date: Fri Mar 04 2022 - 15:13:22 EST


On Fri, Mar 4, 2022 at 2:57 AM Peter Rosin <peda@xxxxxxxxxx> wrote:
>
> On 2022-03-04 07:57, Peter Rosin wrote:
> > On 2022-03-04 04:55, Saravana Kannan wrote:
> >> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <peda@xxxxxxxxxx> wrote:
> >>>
> >>> On 2022-03-03 04:02, Saravana Kannan wrote:
> >>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <peda@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> I'm seeing a weird problem, and I'd like some help with further
> >>>>> things to try in order to track down what's going on. I have
> >>>>> bisected the issue to
> >>>>>
> >>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> >>>>
> >>>> I skimmed through your email and I'll read it more closely tomorrow,
> >>>> but it wasn't clear if you see this on Linus's tip of the tree too.
> >>>> Asking because of:
> >>>> https://lore.kernel.org/lkml/20210930085714.2057460-1-yangyingliang@xxxxxxxxxx/
> >>>>
> >>>> Also, a couple of other data points that _might_ help. Try kernel
> >>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
> >>>> if this was the default by 5.10) vs fw_devlink=off.
> >>>>
> >>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
> >>>> shows a difference driver issues would start becoming a real
> >>>> possibility.
> >>>>
> >>>> -Saravana
> >>>
> >>> Thanks for the quick reply! I don't think I tested the very tip of
> >>> Linus tree before, only latest rc or something like that, but now I
> >>> have. I.e.
> >>>
> >>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
> >>>
> >>> It would have been typical if an issue that existed for a couple of
> >>> years had been fixed the last few weeks, but alas, no.
> >>>
> >>> On that kernel, and with whatever the default fw_devlink value is, the
> >>
> >> It's fw_devlink=on by default from at least 5.12-rc4 or so.
> >>
> >>> issue is there. It's a bit hard to tell if the incident probability
> >>> is the same when trying fw_devlink arguments, but roughly so, and I
> >>> do not have to wait for long to get a bad hash with the first
> >>> reproducer
> >>>
> >>> while :; do cat testfile | sha256sum; done
> >>>
> >>> The output is typical:
> >>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
> >>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
> >>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
> >>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
> >>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
> >>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
> >>>
> >>> Setting fw_devlink=off makes no difference, AFAICT.
> >>
> >> By this, I'm assuming you set fw_devlink=off in the kernel command
> >> line and you still saw the corruption.
> >
> > Yes. On a bad kernel it's the same with all of the following kernel
> > command lines.
> >
> > console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
> >
> > console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
> >
> > console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
> >
> >> If that's the case, I can't see how this could possibly have anything
> >> to do with:
> >> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> >>
> >> If you look at fw_devlink_link_device(), you'll see that the function
> >> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
> >> there, the rest of the code in the series doesn't run because more
> >> fields wouldn't get set, etc. That pretty much disables ALL the code
> >> in the entire series. The only remaining diff would be header file
> >> changes where I add/remove fields. But that's unlikely to cause any
> >> issues here because I'm either deleting fields that aren't used or
> >> adding fields that won't be used (with fw_devlink=off). I think the
> >> patch was just causing enough timing changes that it's masking the
> >> real issue.
> >
> > When I compare fw_devlink_link_device() from before and after
> > f9aa460672c9 ("driver core: Refactor fw_devlink feature")
> > I notice that you also removed an unconditional call to
> > device_link_add_missing_supplier_links() that was live before,
> > regardless of any fw_devlink parameter.
> >
> > I don't know if that's relevant. Is it?
> >
> > Not knowing this code at all, and without any serious attempt
> > at reading it, from here the comment of that removed function
> > sure looks like it might cause a different ordering before and
> > after the patch that is not restored with any fw_devlink
> > argument.
>
> It appears that the device_link_add_missing_supplier_links() difference
> is not relevant after all. What actually happened in the header file in
> the "bad" commit was that two fields were removed (none added). Like so:
>
> struct dev_links_info {
> struct list_head suppliers;
> struct list_head consumers;
> - struct list_head needs_suppliers;
> struct list_head defer_sync;
> - bool need_for_probe;
> enum dl_dev_state status;
> };
>
> If I restore those fields on a bad kernel, the issue is no longer
> visible. That is true for the first bad kernel, i.e.

Ha... I thought this might be a possibility but I wasn't sure. Which
is why I kinda left it at:
"The only remaining diff would be header file
changes where I add/remove fields. But that's unlikely to cause any
issues here because I'm either deleting fields that aren't used or
adding fields that won't be used (with fw_devlink=off)."

Ok, at this point I'm going to ignore this thread. Call me out
explicitly if you want me to pay attention :)

-Saravana