Re: [PATCH v2] x86/PCI: Disable exclusion of E820 reserved addressed in some cases

From: Bjorn Helgaas
Date: Thu Mar 03 2022 - 22:21:27 EST


On Thu, Mar 03, 2022 at 02:29:30PM +0100, Hans de Goede wrote:
> On 3/3/22 01:40, Bjorn Helgaas wrote:
> > On Mon, Feb 28, 2022 at 11:52:59AM +0100, Hans de Goede wrote:
> >
> > I know Rafael has already applied this, but I'm still trying to
> > understand this because it looks like a very complicated maintenance
> > problem.
> >
> >> Some fw has a bug where the PCI bridge window returned by the ACPI
> >> resources partly overlaps with some other address range, causing issues.
> >> To workaround this Linux excludes E820 reserved addresses when allocating
> >> addresses from the PCI bridge window. 2 known examples of such fw bugs are:
> >>
> >> 1. The returned window contains addresses which map to system RAM,
> >> see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating
> >> address space").
> >
> > Bug report is https://bugzilla.kernel.org/show_bug.cgi?id=16228
> > First dmesg log https://bugzilla.kernel.org/attachment.cgi?id=26811
> > shows:
> >
> > BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
> > pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
> > pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
> > pci 0000:00:1f.2: BAR 5: assigned [mem 0xbff00000-0xbff007ff]
> > ahci 0000:00:1f.2: controller reset failed (0xffffffff)
> > ahci 0000:00:1f.2: failed to stop engine (-5)
> >
> > The problem is that _CRS advertises [mem 0xbff00000-0xdfffffff], and
> > we assigned [mem 0xbff00000-0xbff007ff] to 00:1f.2, but
> > 0xbff00000-0xbfffffff is not usable for PCI devices. My guess is that
> > it contains host bridge registers, but all we really know is that it
> > doesn't work.
> >
> > I think the _CRS that includes non-usable space is clearly a BIOS
> > defect.
> >
> > The fix from 4dc2287c1805 was to avoid that region based on the
> > 0xbfe4dc00-0xc0000000 E820 entry, and the result is in
> > https://bugzilla.kernel.org/attachment.cgi?id=30662:
> >
> > BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
> > pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xf7ffffff]
> > pci_root PNP0A03:00: host bridge window [mem 0xff980000-0xff980fff]
> > pci 0000:00:1f.2: reg 24: [mem 0xff970000-0xff9707ff] # BAR 5
> > pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
> > pci 0000:00:1f.2: BAR 5: assigned [mem 0xff980800-0xff980fff]
> >
> > The patch below doesn't affect this workaround.
> >
> >> 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
> >> reserved range and the ACPI provided PCI bridge window:
> >> efi: mem46: [MMIO] range= [0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> >> BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
> >> pci_bus 0000:00: root bus resource [mem 0xdfa00000- 0xfebfffff window]
> >> If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
> >> then the system fails to resume after a suspend.
> >
> > I think this is from https://bugzilla.redhat.com/show_bug.cgi?id=2029207
>
> Correct.
>
> > If I understand correctly, the log in comment 23 from Ivan
> > (https://bugzilla.redhat.com/attachment.cgi?id=1859801) is a
> > case where resume doesn't work:
> >
> > BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
> > pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> > pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff]
> >
> > And the log in comment 38, also from Ivan,
> > (https://bugzilla.redhat.com/attachment.cgi?id=1861539) is a case
> > where resume *does* work:
> >
> > BIOS-e820: [mem 0x00000000dcf00000-0x00000000dfa0ffff] reserved
> > efi: mem46: [MMIO |RUN| | | | | | | | | | | | | ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> > pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> > pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff]
> >
> > _CRS advertises [mem 0xdfa00000-0xfebfffff], but when we assign the
> > 0xdfa00000-0xdfafffff region to the 00:1c.0 MMIO window, resume fails.
>
> Correct this "when we assign the 0xdfa00000-0xdfafffff region to
> the 00:1c.0 MMIO window, resume fails" is my conclusion too.
>
> > I don't see a theory about what the root cause is.
>
> Correct, because I have no theory, I merely observed that:
>
> 1. 0xdfa00000-0xdfa0ffff (note the 0 between the a and ffff)
> is reserved in the E820 ranges (this was first seen with
> classic BIOS boot, but also reproduces with EFI bootiung)
>
> 2. If the kernel assigns PCI resources to it despite it being
> reserved the reporter reports resume being broken.
>
> I indeed do not know the exact cause of this.
>
> > It's possible this
> > is similar to case 1 above, where _CRS is defective.
>
> I definitely consider this another case of _CRS being defective
> and it is similar to 1 in that regards, yes.
>
> > But here we're
> > only assigning the 00:1c.0 MMIO window, and 00:1c.0 is a bridge to
> > [bus 02], and there are no devices on bus 02. There should be no
> > transactions that use that MMIO window, so it's not clear why this
> > should matter.
> >
> > If this is a _CRS defect similar to case 1, it's possible there are
> > host bridge registers in 0xdfa00000-0xdfafffff, and BIOS might use
> > those during resume, and assigning that area to the 00:1c.0 MMIO
> > window might interfere with that. But that's a lot of speculation.
> >
> > If you have a good Lenovo contact, they might be able to confirm or
> > deny this.
> >
> > I'm missing some things here that should be obvious; can you help me
> > out?
> >
> > - Why did the 4dc2287c1805 workaround not apply here? The E820
> > region overlaps the _CRS window just like in case 1, so why did we
> > assign 0xdfa00000?
>
> The 4dc2287c1805 workaround does apply here. The problem is that
> my earlier patch which Rafael merged to not honor e820 reservations
> on newer BIOS-es broke suspend/resume since that patch disables
> the 4dc2287c1805 workaround on this machine.

Ah! That makes sense now, thank you! So this is merely an example of
a newer machine for which 4dc2287c1805 is important. It's not a case
where *this* patch is relevant.

> The comment tries to mention this issue as a second case of
> why the 4dc2287c1805 workaround is necessary.
>
> The goal of this part of the comment is to explain why we
> cannot just disable the 4dc2287c1805 workaround.
>
> > - How does this patch work around the problem? This patch checks if
> > a host bridge window is completely contained in an EFI MMIO
> > region, but I don't see such an EFI region here.
>
> It does not help with case 2 from the comment, since that
> case is already effectively fixed by the 4dc2287c1805 workaround.
>
> This patch is a replacement for my earlier fix to disable
> the 4dc2287c1805 workaround on newer systems. Since that patch
> was causing the regression reported in the Red Hat 2029207
> the patch to disable the 4dc2287c1805 workaround on newer systems
> has been reverted.
>
> So this is yet another attempt to fix the case where
> the PCI subsystem cannot assign resources to some PCI
> devices because the entire root bridge window is covered
> by an EFI memtable MMIO entry, which gets translated into
> an e820_table reserved entry.
>
> What this patch does is identify systems which have the
> problem of the entire root bridge window being covered
> by an EFI memtable MMIO entry and only disable the
> 4dc2287c1805 workaround there.

I think having the entire root bridge window covered by an EFI MMIO
entry is not actually a problem per spec. My previous response to the
log message applies:

> > This message suggests that marking the host bridge window as MMIO in
> > EFI is a defect, or at least something unusual and worthy of being
> > flagged. But I think it's perfectly legal.
> >
> > UEFI v2.8, sec 7.2, says EfiMemoryMappedIO means:
> >
> > Used by system firmware to request that a memory-mapped IO region be
> > mapped by the OS to a virtual address so it can be accessed by EFI
> > runtime services.

The problem is only that Linux doesn't handle this correctly because
of 4dc2287c1805, which makes it think the entire window is
unavailable.

> The idea here is to only disable arch_remove_reservations()
> taking e820 reservations into account on as narrow a set of
> systems as possible.
>
> Keeping it enabled on all other systems, including recent
> systems. Since recent systems also benefit from the
> 4dc2287c1805 workaround as the regression from my patch
> to disable it on recent systems has shown us.
>
> I hope this helps understand things better.

It does, it was very helpful, thanks!

I have an alternate proposal which does basically the same thing
(skipping the 4dc2287c1805 workaround when an E820 entry covers the
entire window), but I think it's slightly simpler.

And more importantly, it adds a little bit of logging and updates the
window itself instead of clipping on every alloc, which I should have
done in 4dc2287c1805 from the beginning, and should make issues easier
to debug.

I'll post it shortly.

Bjorn