Re: AMD EPYC Topology problems

From: Andrew Cooper
Date: Mon Dec 03 2018 - 06:23:57 EST


On 03/12/2018 11:13, Borislav Petkov wrote:
> On Sun, Dec 02, 2018 at 08:23:05PM +0000, Andrew Cooper wrote:
>> Hello,
>>
>> I have dual socket server with the following processor:
>>
>> [root@xrtmia-09-01 ~]# head /proc/cpuinfo
>> processor : 0
>> vendor_id : AuthenticAMD
>> cpu family : 23
>> model : 1
>> model name : AMD EPYC 7281 16-Core Processor
>> stepping : 2
>>
>> Which has highlighted a issue in the topology derivation logic.Â
>> (Actually, it was discovered with Xen, but we share the same topology
>> infrastructure and the issue is also present with Linux).
>>
>> There are a total of 64 threads in the system, made of two 32-thread
>> sockets. The APIC IDs for this system are sparse - they are 0x0-0x3,
>> 0x8-0xb, 0x10-0x13 etc, all the way up to 0x7b.
>>
>> This is because the socket is made of 4 nodes with 4 cores each, but
>> space has been left in the layout for the maximum possible number of
>> APIC IDs.
>>
>> In particular, CPUID 0x80000008:ecx reports 0x0000601f. That is, an
>> APIC ID shift of 6 (reporting a maximum of 64 threads per socket), and
>> NC as 31 (reporting 32 threads per socket in the current configuration).
>>
>> c->x86_max_cores is derived from NC and shifted once to exclude threads,
>> giving it a final value of 16 cores per socket.
> So far so good.
>
>> Given the sparseness of the APIC IDs, it is unsafe to allocate an array
> Do we do this somewhere or is this a hypothetical thing?

Its the root of a memory corruption issue I've been chasing in Xen. It
might be just hypothetical in Linux.

>> of c->x86_max_cores entries, then index it with c->cpu_core_id, as half
>> the cores in the system have a cpu_core_id greater than x86_max_cores.Â
> You lost me here. ->cpu_core_id comes from CPUID_Fn8000001E_EBX[7:0].
> Are you saying, those core IDs on your box are sparse like the APIC IDs
> you mention above?

Correct. They are sparse, like the APIC IDs. (Sorry - I should have
made this clearer to begin with).

>> There is no logical core ID derived during boot which might be a safe to
>> use as an index.
>>
>> Furthermore, the documentation indicates that these values are expected
>> to be per-package, while they are all actually per-socket (with up to 4
>> nodes per socket) in the EPYC case.
> From Documentation/x86/topology.txt:
> "
> - cpuinfo_x86.x86_max_cores:
>
> The number of cores in a package. This information is retrieved via CPUID."
>

Right, but the documentation also states that where it says package, it
means "Node" in AMD's terminology, and the information in CPUID is per
socket, not per node.

My point is that the numbers ending up in cpuinfo_x86 don't match the
semantics described by the documentation.

~Andrew