[FYI] GCC segfaults under heavy multithreaded compilation with AMD Ryzen

From: Satoru Takeuchi
Date: Tue Jul 25 2017 - 17:54:08 EST


# I'm a LKML subscriber, but not a x86 list subscriber

I found the following new linux kernel bugzilla about Ryzen related problem.
Since many developers don't check this bugzilla and I've also
encountered this problem,
I decided to introduce this problem here.

https://bugzilla.kernel.org/show_bug.cgi?id=196481:
> I am running Ubuntu and installed the mainline kernel from the mainline PPA.
> It seems like the Ryzen processor has some bug that leads to gcc crashing
> when compiling a very large program under heavy load. This is easily reproduced
> in my system using the script from
>
> https://github.com/suaefar/ryzen-test
>
> (It assumes that you are running Ubuntu, maybe Debian also works. Just clone it and run the > script kill_ryzen.sh. It downloads the gcc 7.1 code and start multiple compilations of it. If any
> compilations fails its warns the user giving the time to detect failure).
>
> There is already a bug report about this in the FreeBSD bugzilla
> (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c89).
> There is also a thread on the subject in AMD community forum
> (https://community.amd.com/thread/215773?start=300&tstart=0)
> and Phoronix (https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads).
>
> This is probably a processor bug. But I thought that I should try to call the attention of
> the kernel developers to this issue as it may be possible to workaround it in the kernel.
>
> Obs: If I disable SMT in BIOS the problem gets much better moving from failures
> after a couple of minute to one failure in 3 to 4 hours)

What I want here is that this problem is known by many people,
especially by x86 experts,
asking the hint to find the root cause, and making the reliable
workaround patch.

Summary of this problem from my point of view:
- gcc sometimes fails with SEGV at random
- at least part of this problem is caused by running instructions at
"RIP - 0x40"
- tens of people encountered this problem
- probably it is a hardware problem: many OSes WSL, NetBSD, and
FreeBSD encountered the very similar problem. In addition, this
problem happens with ECC memory and memtest86 clean memory
- the root cause is not found yet. AMD have seemed to try to find it
for several months, but there have been no update from AMD yet
- There are workaround patch in FreeBSD, but it's not sure that it's a
reliable one since the root cause is not sure

Fore more detail, please refer to the links at the above mentioned bugzilla.

Regards,
Satoru