Re: [RFC v2 PATCH 1/1] PCI: override BIOS/firmware resourceallocation

From: Ram Pai
Date: Tue Oct 19 2010 - 13:18:07 EST


On Mon, Oct 18, 2010 at 01:10:10PM -0700, Jesse Barnes wrote:
> On Tue, 12 Oct 2010 13:01:54 -0600
> Bjorn Helgaas <bjorn.helgaas@xxxxxx> wrote:
>
> > On Tuesday, October 12, 2010 01:05:15 am Ram Pai wrote:
> > > On Fri, Oct 08, 2010 at 02:16:57PM -0600, Bjorn Helgaas wrote:
> > > > On Friday, October 08, 2010 11:32:24 am Ram Pai wrote:
> > > > > On Thu, Oct 07, 2010 at 03:41:04PM -0600, Bjorn Helgaas wrote:
> > > > > > On Thursday, October 07, 2010 02:42:13 pm Ram Pai wrote:
> > > > > > > On Wed, Oct 06, 2010 at 10:13:02PM -0600, Bjorn Helgaas wrote:
> > > > > > > > On Wed, Oct 06, 2010 at 05:30:41PM -0700, Ram Pai wrote:
> > > > > > > > > On Wed, Oct 06, 2010 at 05:39:53PM -0600, Bjorn Helgaas wrote:
> > > > > > > > > > On Wed, Oct 06, 2010 at 03:58:34PM -0700, Ram Pai wrote:
> > > > > > > > > > > PCI: override BIOS/firmware memory resource allocation
> > > > > > > > > > > through command line parameters
> > > > > > > > > > >
> > > > > > > > > > > Platforms that are unaware of SRIOV BARs fail to allocate MMIO
> > > > > > > > > > > resources to SRIOV PCIe devices. Hence on such platforms the
> > > > > > > > > > > OS fails to enable SRIOV.
> > > > > > > > > > > Some platforms where BIOS/uEFI resource allocations conflict
> > > > > > > > > > > the conflicting devices are disabled.
> > > > > > > > > > >
> > > > > > > > > > > Ideally we would want the OS to detect and fix automatically
> > > > > > > > > > > such problems and conflicts. However previous attempts to do so
> > > > > > > > > > > have led to regression on legacy platforms.
> > > > > > > > > >
> > > > > > > > > > I'm sorry to be a nay-sayer, but I think we just haven't tried hard
> > > > > > > > > > enough. Our ACPI/PCI/e820 resource management is not well integrated,
> > > > > > > > > > and I suspect if we straightened that out, we could avoid some of the
> > > > > > > > > > regressions we saw with previous attempts.
> > > > > > > > >
> > > > > > > > > Can you be more specific as to what can be done to fix it automatically?
> > > > > > > > >
> > > > > > > > > Neither accepting this approach nor telling what needs to be straightened out
> > > > > > > > > to automatically fix all the systems out there, is just a deadend.
> > > > > > > >
> > > > > > > > Yeah, I guess that wasn't really fair, sorry. And keep in mind that I'm
> > > > > > > > not the PCI maintainer, so these are just my opinions, nothing like an
> > > > > > > > official "nack."
> > > > > > > >
> > > > > > > > I did look at this dmesg log from the thread you referenced:
> > > > > > > > http://marc.info/?l=linux-kernel&m=127178918128740&w=2
> > > > > > > > but it looks to me like we just completely botched it. I don't see an
> > > > > > > > SRIOV device or anything else that didn't have resources, so as far as I
> > > > > > > > can tell, we started with working resource assignments from the BIOS,
> > > > > > > > threw them away, and started over from scratch. We failed because we
> > > > > > > > tried to assign I/O port space to bridges with nothing behind them, and
> > > > > > > > there was nothing left by the time we got to the 0000:09:04.0 device
> > > > > > > > that actually *did* need the space.
> > > > > > >
> > > > > > > hmm.. is that possible? Yinghai's patch sized the resource requirement of each
> > > > > > > of the bridges, before actually allocating them. Which means a bridge with
> > > > > > > no device behind it would not get any i/o space.
> > > > > >
> > > > > > Here's what I see in the dmesg log referenced above:
> > > > > >
> > > > > > ACPI: PCI Root Bridge [PCI0] (0000:00)
> > > > > > pci_root PNP0A08:00: host bridge window [io 0x0000-0x0cf7]
> > > > > > pci_root PNP0A08:00: host bridge window [io 0x0d00-0xffff]
> > > > > > pci 0000:00:1c.0: PCI bridge to [bus 04-09]
> > > > > > pci 0000:00:1c.0: bridge window [io 0xd000-0xdfff]
> > > > > > pci 0000:04:00.0: PCI bridge to [bus 05-09]
> > > > > > pci 0000:04:00.0: bridge window [io 0xd000-0xdfff]
> > > > > > pci 0000:05:01.0: PCI bridge to [bus 08-09]
> > > > > > pci 0000:05:01.0: bridge window [io 0xd000-0xdfff]
> > > > > > pci 0000:08:00.0: PCI bridge to [bus 09-09]
> > > > > > pci 0000:08:00.0: bridge window [io 0xd000-0xdfff]
> > > > > > pci 0000:09:04.0: found [13f6:8788] class 000401 header type 00
> > > > > > pci 0000:09:04.0: reg 10: [io 0xd800-0xd8ff]
> > > > > > pci 0000:05:02.0: PCI bridge to [bus 07-07]
> > > > > > pci 0000:05:02.0: bridge window [io 0xf000-0x0000] (disabled)
> > > > > > pci 0000:05:03.0: PCI bridge to [bus 06-06]
> > > > > > pci 0000:05:03.0: bridge window [io 0xf000-0x0000] (disabled)
> > > > > >
> > > > > > The above is the state as we got it from BIOS. Despite all the bridges,
> > > > > > 09:04.0 is the only device below the 00:1c.0 bridge, and it requires only
> > > > > > 0x100 I/O ports.
> > > > > >
> > > > > > There are no devices on buses 06 (below 05:03.0) or 07 (below 05:02.0).
> > > > > >
> > > > > > I didn't look at Yinghai's patch to figure out *why*, but it sure looks like
> > > > > > we released the 09:04.0 space, then tried to assign 0x2000 ports to 05:01.0
> > > > > > (which needs 0x100 and had 0x1000 originally), 0x1000 to 05:02.0 (which needs
> > > > > > none), and 0x1000 to 05:03.0 (which also needs none):
> > > > > >
> > > > > > PCI: No. 3 try to assign unassigned res
> > > > > > release child resource [io 0xd800-0xd8ff]
> > > > > > pci 0000:08:00.0: resource 7 [io 0xd000-0xdfff] released
> > > > > > pci 0000:04:00.0: BAR 7: can't assign io (size 0x4000)
> > > > > > pci 0000:05:01.0: BAR 7: can't assign io (size 0x2000)
> > > > > > pci 0000:05:02.0: BAR 7: can't assign io (size 0x1000)
> > > > > > pci 0000:05:03.0: BAR 7: can't assign io (size 0x1000)
> > > > > > pci 0000:08:00.0: BAR 7: can't assign io (size 0x1000)
> > > > >
> > > > > Actually the message preceeding to them are even more surprising:
> > > > >
> > > > > Apr 20 20:31:42 [kernel] pci 0000:04:00.0: BAR 8: can't assign mem (size
> > > > > 0xc00000)
> > > > > Apr 20 20:31:42 [kernel] pci 0000:05:01.0: BAR 8: can't assign mem (size
> > > > > 0x200000)
> > > > > Apr 20 20:31:42 [kernel] pci 0000:05:01.0: BAR 9: can't assign mem pref
> > > > > (size 0x200000)
> > > > > Apr 20 20:31:42 [kernel] pci 0000:05:02.0: BAR 8: can't assign mem (size
> > > > > 0x400000)
> > > > > Apr 20 20:31:42 [kernel] pci 0000:05:03.0: BAR 7: can't assign io (size
> > > > > 0x1000)
> > > > >
> > > > > Do these bridges have IOV BARs and those BARs are demanding i/o resources?
> > > > > Something is really funny with this machine. Or I am reading this wrong?
> > > >
> > > > I don't see anything strange about the machine, but I don't understand
> > > > what Yinghai's patch was doing. The lspci output is here:
> > > > http://marc.info/?l=linux-pci&m=127184080929189&w=2
> > > >
> > > > I don't see any SRIOV devices, and this kernel wasn't built with IOV
> > > > support anyway. The messages above are attempts to assign resources
> > > > for bridge windows.
> > > >
> > > > But the only real device in the whole hierarchy behind 00:1c.0
> > > > is 09:04.0, which doesn't require memory space, so I have no idea
> > > > why we're trying to open memory windows.
> > >
> > > I dont understand the code much. But there is some alignment
> > > constraints on the memory and io space dictated by the PCI code in
> > > pbus_size_io() and pbus_size_mem(), a constraint that ends up
> > > associating a minimum of 4096 bytes io window and
> > > nearly about 2MB mem window to a bridge irrespective of the
> > > requirements of the devices behind the bridge. That sounds like a
> > > bug to me. But I don't know if those are valid constraints either.
> > > Any idea?
> >
> > I think memory windows comes in 1MB chunks and I/O in 4096 port chunks.
> > There are some bridges that support smaller I/O windows (search for
> > "en1k" in drivers/pci/quirks.c), but it looks like we don't take that
> > into account when allocating space.
> >
> > And of course, there's no requirement to enable the window at all if
> > we don't have any downstream devices.
> >
> > > > This is what I meant about "we haven't tried hard enough yet." I'd
> > > > rather spend time fixing the problems like this, than just putting the
> > > > buggy code in and adding a kernel parameter to enable it. That
> > > > guarantees that we'll never get enough testing to really shake it
> > > > out.
> > >
> > > Ok. I am no fan of kernel parameters either. However I am afraid
> > > we will attract furious looks as and when someone regresses.
> >
> > Yep, I'm used to that :-)
>
> So where do we stand with this machine's problem?

I think, this machine with the latest mainline kernel, will see memory resource
allocation failure messages. Since the latest kernel does not release and retry
to allocate resources, the io resources allocated by the BIOS continue to stay
put and hence the problem is masked.

However, any attempt to release and reallocate the resource on that machine
will fail; because as pointed out by Bjorn, there is some weird allocation
behavior in the current code. Unfortunately I cannot trigger that behavior on
any of my machines.

I have requested data from Peter, who originally reported the problem.
Hope he still has his setup with the Xonar card available for debugging.

Anyway, I do see a smoking gun in pbus_size_io() and pbus_size_mem(). They
call resource_size() to find the size requirement of resources of all devices
behind the bridge. However for resources whose start and size are set to zero,
resource_size() returns one. Later ALIGN() rounds it up to the next higher
alignment boundary.

>
> Ram, do you have other machines that require your override patch?

Yes, I have a couple of machines whose BIOS is unaware of SRIOV resources,
These machines need the override patch. :(


>
> Until we understand what's failing and why, I'm hesitant to apply a
> patch that will work around the problem but require an extra kernel
> parameter.

We have already come a full circle here. The original approach was
reverted because it regressed a platform. Now we are rejecting this
approach because we want the original approach.

>
> Our general approach is to avoid overwriting the BIOS provided
> configuration, and when we do, to use available e820 and PnP data to
> figure out where to put things, in addition to allocating space in a
> similar way to Windows to avoid hitting undiscovered platform bugs
> (i.e. Bjorn's recent patchset).
>
> So while I'm not totally opposed to this patch, I'd feel much better
> about it if we really understood how the failing machines were failing
> before we try to workaround potential BIOS limitations.

Ok.
RP
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/