Re: [PATCH 0/2] PCI: PCI-E MPS bug fixes

From: Jon Mason
Date: Tue Sep 13 2011 - 16:55:56 EST


On Tue, Sep 13, 2011 at 5:50 AM, Benjamin Herrenschmidt
<benh@xxxxxxxxxxxxxxxxxxx> wrote:
>
>>
>> You should apply them. I would queue them, and but I don't have
>> another PCI tree anywhere for you to pull them.
>>
>> In fact, I'll be out for the next few weeks; I'll ask Bjorn to cover
>> for me.
>
> Note that I might be missing some patches, but what's currently in Linus
> tree is rather wrong.

It is worth noting that with my last patch, this is not the default
behavior. Whether it is faulty is open for debate.

>
> The comment is right, but the implementation is bogus:
>
> static void pcie_write_mps(struct pci_dev *dev, int mps)
> {
>        int rc, dev_mpss;
>
>        dev_mpss = 128 << dev->pcie_mpss;
>
>        if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
>                if (dev->bus->self) {
>                        dev_dbg(&dev->bus->dev, "Bus MPSS %d\n",
>                                128 << dev->bus->self->pcie_mpss);
>
>                        /* For "MPS Force Max", the assumption is made that
>                         * downstream communication will never be larger than
>                         * the MRRS.  So, the MPS only needs to be configured
>                         * for the upstream communication.  This being the case,
>                         * walk from the top down and set the MPS of the child
>                         * to that of the parent bus.
>                         */
>                        mps = 128 << dev->bus->self->pcie_mpss;
>
> So here, we decide to use as an mps, the mpss of the bridge. That's already not
> quite right. The bridge might itself be clamped down (because it's parent has
> a smaller MPSS, well, your other bugs below will prevent that but that's how
> it should be)

This is started from the root complex. So, it should be able to
handle the maximum setting of the bridge chip.

>
> So we should use the MPS of the bridge not the MPSS (ie the configured MPS of
> the bridge, not the max MPS of the bridge).
>
> That does mean that we do need to configure the bridge first before we configure
> the bridge children, I don't remember if that's what happens with the walk function
> so we need to dbl check (I have to run, no time to check right now)

Yes, that is the behavior.

>
>                        if (mps > dev_mpss)
>                                dev_warn(&dev->dev, "MPS configured higher than"
>                                         " maximum supported by the device.  If"
>                                         " a bus issue occurs, try running with"
>                                         " pci=pcie_bus_safe.\n");
>
> Yeah right. Instead of warning, we should actually -clamp- it. IE If the
> bridge can do 4096 bytes and the device can only do 128 bytes, it's crazy
> to try to set the device to 4096 bytes.

You are correct. I was looking at this code when the bugs started
coming in and realized this was erroneous. A simple mps=min(mps,
dev_mpss) should fix it.

> The idea is that we set the device to the higher it can up to the size
> supported by the brigde, not the exact size of the brigde regardless of
> whether that's bigger than what the device can handle.
>
> It -never- makes any sense to configure a device to an MPS larger than
> it's own maxium.
>
>                }
>
>                dev->pcie_mpss = ffs(mps) - 8;
>
> And here I'm baffled. Why on earth would we -ever- want to change dev->pcie_mpss ?

I was being overly clever here and using it as a way to actual MPS of
the bridges. A comment here would be helpful, but in hindsight I
think it is a bad idea to do it this way.

>
> This is the the HARD maximum supported by the device, as read from the capability
> register. Changing our idea of what the device supports as a maximum isn't going
> to help us... all it will do is allow the subsequent:
>
>        }
>
>        rc = pcie_set_mps(dev, mps);
>        if (rc)
>                dev_err(&dev->dev, "Failed attempting to set the MPS\n");
>
> To actually try to program the bogus mps value we just calculated (ie the device
> larger than what it supports).
>
> }
>
> The right thing to do here for the "performance" case is to use the min() of
> the bus mps and the dev_mpss basically.
>
> Ie. Going top down:
>
>  - If !parent, mps = mpss else mps = min(mpss, parent->mps)
>  - Iterate children
>
> That's it.
>
> So of course, the current implementation -will- break things. It might even explain
> some of the radeon problems you've been observing, ie, might have nothing to do with
> MRRS...

My patch before that removed MRRS setting from the "performance"
option solved a majority of the problems seen, but as you said in
response to that patch, it was buggy.

>
> As to whether we should default to "safe" or "performance", well, it really depends
> on the platform.
>
> I know for a fact on power that my host bridges are never going to generate a downstream
> request with a TLP larger than 128 bytes, and we never do PCIe  <-> PCIe transfers,
> so "performance" is good for me.
>
> x86 with funky DMA engines in the CPU etc... might be capable of generating larger
> TLPs, in which case it must switch to "safe".

Only is it is not observing the MRRS value.

>
> Note that this whole business is really a mis-design of PCIe. The 'performance' mode is
> a way to try to work around it and not end up with everything clamped down to the minimum
> MPS (128 bytes) all the time, but it's a bit risky. Ideally the spec should have separated
> at least the MPS of packets generated vs. the MPS of packets received at the root
> complex level.

And a fix for poorly written BIOS (the original cause of this patch).

>
> Now, I am giving some courses/training today and may not have time to propose a patch,
> but if you don't get to it by tomorrow I'll see what I can do.

I'll make a pass at this shortly and send it to linux-pci. Thanks for
going over the code.

>
> Cheers,
> Ben.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/