Re: [PATCH] Prevent AMD MCE oops on multi-server system

From: Borislav Petkov
Date: Mon Oct 01 2012 - 14:01:47 EST


On Tue, Oct 02, 2012 at 12:12:31AM +0800, Daniel J Blueman wrote:
> On 01/10/2012 18:06, Borislav Petkov wrote:
> >On Mon, Oct 01, 2012 at 02:42:05PM +0800, Daniel J Blueman wrote:
> >>When booting on a federated multi-server system, the processor Northbridge
> >>lookup returns NULL; add guards to prevent this causing an oops.
> >Interesting.
> >
> >What does lspci say on those systems?
> >
> >Thanks.
> As NumaConnect remote-server I/O is in a pre-release stage, we only
> expose I/O on the first (root) server, so the lspci on eg my three
> server, single-socket C32 development system is uninteresting [1].

Yeah, I was looking for the NB devices:

> 00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
> 00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Link Control

[ â ]

> We map MMCONFIG addresses in the global address map to the
> respective server, which is how we access the processor Northbridges
> in the bootloader before Linux loads, so they are accessible and get
> enumerated when we enable remote I/O with the ACPI SSDT we generate,
> however since the AMD APIC IDs (hence NB IDs) are only 8-bit, the
> present amd_get_nb_id will produce duplicate NB IDs at best (but in
> this case, as we disable I/O routing, there is no structure); later,
> we may propose to using eg bits 23:8 for the server ID. That's
> another discussion though.

Ah yes, I remember now. We had this discussion already, AFAIR. So if you
say you disable I/O routing, what actually doesn't work out as expected
is the NB enumeration in amd_nb.c where pci_get_device simply fails?

Because if you had duplicate APIC IDs, you'd atleast get some NB
descriptor, even if not the correct one?

> The minimal patch at least corrects the oops regression which didn't
> happen in earlier kernels.

Right, I beefed it up a bit and added a stable tag, pls take a look and
let me know if it is ok. I'll run it on a couple of machines but I don't
expect any issues so I'll send it upstream soon.

Thanks.

---