Re: PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression)

From: Linux kernel regression tracking (#adding)
Date: Mon Jan 30 2023 - 04:50:16 EST


[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 29.01.23 03:17, Nick Bowler wrote:
>
> Starting with Linux 6.1.y, my sparc64 (Sun Ultra 60) system is very
> unstable, with userspace processes randomly crashing with all kinds of
> different weird errors. The same problem occurs on 6.2-rc5. Linux
> 6.0.y is OK.
>
> Usually, it manifests with ssh connections just suddenly dropping out
> like this:
>
> malloc(): unaligned tcache chunk detected
> Connection to alectrona closed.
>
> but other kinds of failures (random segfaults, bus errors, etc.) are
> seen too.
>
> I have not ever seen the kernel itself oops or anything like that, there
> are no abnormal kernel log messages of any kind; except for the normal
> ones that get printed when processes segfault, like this one:
>
> [ 563.085851] zsh[2073]: segfault at 10 ip 00000000f7a7c09c (rpc
> 00000000f7a7c0a0) sp 00000000ff8f5e08 error 1 in
> libc.so.6[f7960000+1b2000]
>
> I was able to reproduce this fairly reliably by using GNU ddrescue to
> dump a disk from the dvd drive -- things usually go awry after a minute
> or two. So I was able to bisect to this commit:
>
> 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc is the first bad commit
> commit 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc
> Author: Peter Xu <peterx@xxxxxxxxxx>
> Date: Thu Aug 11 12:13:29 2022 -0400
>
> mm: remember young/dirty bit for page migrations
>
> This does not revert cleanly on master, but I ran my test on the
> immediately preceding commit (0ccf7f168e17: "mm/thp: carry over dirty
> bit when thp splits on pmd") extra times and I am unable to get this
> one to crash, so reasonably confident in this bisection result...
>
> Let me know if you need any more info!

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 2e3468778dbe3ec3
#regzbot title sparc64: random crashes
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.