Problems with Zen under Xen and recent Linux kernel improvements

From: Adam Novak
Date: Mon Jul 30 2018 - 21:14:36 EST


Hello,

I was advised to take this here, and to Boris Ostrovsky and Juergen
Gross, by Thomas Gleixner.

I am having some trouble with the new speculation control code that
has been added to the Linux kernel, for AMD Zen CPUs. I am running an
AMD Ryzen 7 1700, and I am running Linux as a Xen dom0 (which is part
of the problem; the code seems to work fine running outside of Xen).

I started having trouble on Ubuntu's commit
3f6a3b035f91a22c0d3bd27630bf61eac9c8cf6c, "x86/speculation: Handle HT
correctly on AMD", which appears to be cherry-picked from
1f50ddb4f4189243c05926b842dc1a0332195f31. Since that commit, my system
hangs during the boot process; it starts starting stuff up and trying
to mount things and printing "[OK]" messages, but then fairly early in
the boot process the kernel complains that it is "unable to handle
kernel NULL pointer deference at 000...0008"

On my Ubuntu bug:

https://bugs.launchpad.net/bugs/1777338

I have a "Screenshot of the null pointer dereference message". It is
running into trouble during a spin lock in the new
speculative_store_bypass_update().

Has anyone else seen this behavior on these CPUs under Xen (I am using 4.9)?

Since the commit that started the problem has to do with sibling CPU
cores, I suspect that the problem may have something to do with how
Xen handles hyperthreading. Namely, Xen seems to hide hyperthreading
from the VMs running under it (including from dom0). Instead of having
8 CPUs with 2 threads each, my Linux running under Xen on my 8-core
Ryzen chip sees 16 virtual CPU cores, all of which still report
themselves as being the Ryzen 7 1700 processor.

For reference, my /proc/cpuinfo looks like this at the tail end:

processor : 14
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen 7 1700 Eight-Core Processor
stepping : 1
microcode : 0x8001137
cpu MHz : 2994.027
cache size : 512 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 16
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae mce cx8 apic mca cmov pat clflush
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm constant_tsc
rep_good nopl nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma
cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
lahf_lm cmp_legacy abm sse4a misalignsse 3dnowprefetch bpext cpb
vmmcall fsgsbase bmi1 avx2 bmi2 rdseed adx clflushopt sha_ni xsaveopt
xsavec xgetbv1 clzero ibpb arat ssbd
bugs : fxsave_leak null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 5989.03
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:

processor : 15
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen 7 1700 Eight-Core Processor
stepping : 1
microcode : 0x8001137
cpu MHz : 2994.027
cache size : 512 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 16
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae mce cx8 apic mca cmov pat clflush
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm constant_tsc
rep_good nopl nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma
cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
lahf_lm cmp_legacy abm sse4a misalignsse 3dnowprefetch bpext cpb
vmmcall fsgsbase bmi1 avx2 bmi2 rdseed adx clflushopt sha_ni xsaveopt
xsavec xgetbv1 clzero ibpb arat ssbd
bugs : fxsave_leak null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 5989.03
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:

All the cores have core ID 0, and the CPU says it has 16 cores. When
booted outside of Xen, I still have processors 0-15 in /proc/cpuinfo,
but they come in pairs with core IDs 0-7, and "CPU cores" is 8.

If it looks like this during the boot process, and the new
sibling-thread-aware code is looking for hyperthreading that Xen
doesn't expose, maybe that is causing the problem?

Thanks,
-Adam