PROBLEM: sparc64 random crashes starting w/ Linux 6.1 (regression)

From: Nick Bowler
Date: Sat Jan 28 2023 - 21:17:43 EST


Hi,

Starting with Linux 6.1.y, my sparc64 (Sun Ultra 60) system is very
unstable, with userspace processes randomly crashing with all kinds of
different weird errors. The same problem occurs on 6.2-rc5. Linux
6.0.y is OK.

Usually, it manifests with ssh connections just suddenly dropping out
like this:

malloc(): unaligned tcache chunk detected
Connection to alectrona closed.

but other kinds of failures (random segfaults, bus errors, etc.) are
seen too.

I have not ever seen the kernel itself oops or anything like that, there
are no abnormal kernel log messages of any kind; except for the normal
ones that get printed when processes segfault, like this one:

[ 563.085851] zsh[2073]: segfault at 10 ip 00000000f7a7c09c (rpc
00000000f7a7c0a0) sp 00000000ff8f5e08 error 1 in
libc.so.6[f7960000+1b2000]

I was able to reproduce this fairly reliably by using GNU ddrescue to
dump a disk from the dvd drive -- things usually go awry after a minute
or two. So I was able to bisect to this commit:

2e3468778dbe3ec389a10c21a703bb8e5be5cfbc is the first bad commit
commit 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc
Author: Peter Xu <peterx@xxxxxxxxxx>
Date: Thu Aug 11 12:13:29 2022 -0400

mm: remember young/dirty bit for page migrations

This does not revert cleanly on master, but I ran my test on the
immediately preceding commit (0ccf7f168e17: "mm/thp: carry over dirty
bit when thp splits on pmd") extra times and I am unable to get this
one to crash, so reasonably confident in this bisection result...

Let me know if you need any more info!

Thanks,
Nick