Re: Regression: memory corruption on Atmel SAMA5D31

From: Peter Rosin
Date: Fri Mar 04 2022 - 01:57:47 EST


On 2022-03-04 04:55, Saravana Kannan wrote:
> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <peda@xxxxxxxxxx> wrote:
>>
>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <peda@xxxxxxxxxx> wrote:
>>>>
>>>> Hi!
>>>>
>>>> I'm seeing a weird problem, and I'd like some help with further
>>>> things to try in order to track down what's going on. I have
>>>> bisected the issue to
>>>>
>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>
>>> I skimmed through your email and I'll read it more closely tomorrow,
>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>> Asking because of:
>>> https://lore.kernel.org/lkml/20210930085714.2057460-1-yangyingliang@xxxxxxxxxx/
>>>
>>> Also, a couple of other data points that _might_ help. Try kernel
>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>> if this was the default by 5.10) vs fw_devlink=off.
>>>
>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>> shows a difference driver issues would start becoming a real
>>> possibility.
>>>
>>> -Saravana
>>
>> Thanks for the quick reply! I don't think I tested the very tip of
>> Linus tree before, only latest rc or something like that, but now I
>> have. I.e.
>>
>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>
>> It would have been typical if an issue that existed for a couple of
>> years had been fixed the last few weeks, but alas, no.
>>
>> On that kernel, and with whatever the default fw_devlink value is, the
>
> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>
>> issue is there. It's a bit hard to tell if the incident probability
>> is the same when trying fw_devlink arguments, but roughly so, and I
>> do not have to wait for long to get a bad hash with the first
>> reproducer
>>
>> while :; do cat testfile | sha256sum; done
>>
>> The output is typical:
>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>
>> Setting fw_devlink=off makes no difference, AFAICT.
>
> By this, I'm assuming you set fw_devlink=off in the kernel command
> line and you still saw the corruption.

Yes. On a bad kernel it's the same with all of the following kernel
command lines.

console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)

console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)

console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)

> If that's the case, I can't see how this could possibly have anything
> to do with:
> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>
> If you look at fw_devlink_link_device(), you'll see that the function
> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
> there, the rest of the code in the series doesn't run because more
> fields wouldn't get set, etc. That pretty much disables ALL the code
> in the entire series. The only remaining diff would be header file
> changes where I add/remove fields. But that's unlikely to cause any
> issues here because I'm either deleting fields that aren't used or
> adding fields that won't be used (with fw_devlink=off). I think the
> patch was just causing enough timing changes that it's masking the
> real issue.

When I compare fw_devlink_link_device() from before and after
f9aa460672c9 ("driver core: Refactor fw_devlink feature")
I notice that you also removed an unconditional call to
device_link_add_missing_supplier_links() that was live before,
regardless of any fw_devlink parameter.

I don't know if that's relevant. Is it?

Not knowing this code at all, and without any serious attempt
at reading it, from here the comment of that removed function
sure looks like it might cause a different ordering before and
after the patch that is not restored with any fw_devlink
argument.

> IIRC (it's been more than a year), the series [1] that brings in this
> patch has a few reverts. Those reverts undo subtle device probe
> ordering changes brought in by a bunch of earlier patches. You could
> go back to before those patches were added and see if you still see
> this corruption and then start bisecting from there. Basically try
> going to a point before:
> 42926ac3cd50 ("driver core: Move code to the right part of the file")

That patch was added after 5.7-rc5, so just to make sure, I have now
also tested 5.6. As expected, it looks like a good kernel from here.
It's been running while I have written this mail and has consistently
produced good hashes.

I arrived at the bad patch by first noticing that 5.15.6 was bad and
that 4.14 was good. I then did a manual preliminary bisect-like
thing and concluded that 5.1 was good, 5.8 was good, 5.11 was bad,
and that 5.10 was good (I think that was the order anyway, not that
it matters all that much). I then did a "proper" bisect between 5.10
and 5.11.

$ git bisect log
git bisect start
# good: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect good 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# bad: [f40ddce88593482919761f74910f42f4b84c004b] Linux 5.11
git bisect bad f40ddce88593482919761f74910f42f4b84c004b
# bad: [538fcf57aaee6ad78a05f52b69a99baa22b33418] Merge branches 'acpi-scan', 'acpi-pnp' and 'acpi-sleep'
git bisect bad 538fcf57aaee6ad78a05f52b69a99baa22b33418
# good: [15b447361794271f4d03c04d82276a841fe06328] mm/lru: revise the comments of lru_lock
git bisect good 15b447361794271f4d03c04d82276a841fe06328
# good: [d635a69dd4981cc51f90293f5f64268620ed1565] Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect good d635a69dd4981cc51f90293f5f64268620ed1565
# bad: [2911ed9f47b47cb5ab87d03314b3b9fe008e607f] Merge tag 'char-misc-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad 2911ed9f47b47cb5ab87d03314b3b9fe008e607f
# good: [c367caf1a38b6f0a1aababafd88b00fefa625f9e] Merge tag 'sound-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good c367caf1a38b6f0a1aababafd88b00fefa625f9e
# good: [93f998879cd95b3e4f2836e7b17d6d5ae035cf90] Merge tag 'extcon-next-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into char-misc-next
git bisect good 93f998879cd95b3e4f2836e7b17d6d5ae035cf90
# good: [b5206275b46c30a8236feb34a1dc247fa3683d83] usb: typec: tcpm: convert comma to semicolon
git bisect good b5206275b46c30a8236feb34a1dc247fa3683d83
# good: [9e1792727ead477f49958578d0dbd466a7deea48] tty: use const parameters in port-flag accessors
git bisect good 9e1792727ead477f49958578d0dbd466a7deea48
# good: [157f809894f3cf8e62b4011915a00398603215c9] Merge tag 'tty-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect good 157f809894f3cf8e62b4011915a00398603215c9
# good: [25ac86c6dbe62fba9b97e997fa648cdbe2d40173] driver core: Use device's fwnode to check if it is waiting for suppliers
git bisect good 25ac86c6dbe62fba9b97e997fa648cdbe2d40173
# bad: [9c30921fe7994907e0b3e0637b2c8c0fc4b5171f] driver core: platform: use bus_type functions
git bisect bad 9c30921fe7994907e0b3e0637b2c8c0fc4b5171f
# bad: [5b6164d3465fcc13b5679c860c452963443172a7] driver core: Reorder devices on successful probe
git bisect bad 5b6164d3465fcc13b5679c860c452963443172a7
# good: [e82a840cb1c1c83d01a9b81bb63b6cf1c09239d7] efi: Update implementation of add_links() to create fwnode links
git bisect good e82a840cb1c1c83d01a9b81bb63b6cf1c09239d7
# bad: [2d09e6eb4a6f20273959f4905ccf009da8c64c7a] driver core: Delete pointless parameter in fwnode_operations.add_links
git bisect bad 2d09e6eb4a6f20273959f4905ccf009da8c64c7a
# bad: [f9aa460672c9c56896cdc12a521159e3e67000ba] driver core: Refactor fw_devlink feature
git bisect bad f9aa460672c9c56896cdc12a521159e3e67000ba
# first bad commit: [f9aa460672c9c56896cdc12a521159e3e67000ba] driver core: Refactor fw_devlink feature

Since I need drivers that was added for 5.11, and it was easy
to revert there, I landed at 5.11.22. And while that seems
workable at the moment, it's of course not at all where I want
to be.

Since then, I have tried a fair few kernels after 5.11, and
they have all been bad. I'm sad so say that I have not kept a
log of exactly which ones though.

> TL;DR: is that since you are reproducing this with fw_devlink=off, I'm
> pretty sure the problem is not actually because of my changes or any
> changes related to fw_devlink.

I too don't get it, but it's a little bit too consistent with
everything pointing at this one patch across so many changes.
Nothing is good after this patch, and it all behaves a little
bit to similar across the bad kernels for it to be some subtle
timing issue. Methinks. But maybe I just need to stumble on
to some later good kernel. Not holding my breath though...

But it does seem related to interrupts, as I mentioned in the
original mail, I can take a bad kernel and reduce the interrupt
pressure from USB from slightly more than 1kHz down to a
trickle and things behave much better when it comes to sha256sum.
Copying with scp might cause network interrupts, so the two
reproducers I have are perhaps quite similar? If that's the
case, then trigger would be page cache churn, interrupts and a
fair bit of CPU usage (calculating hashes or encrypting).

Cheers,
Peter