Re: [PATCH v6 1/2] PCI: iproc: Retry request when CRS returned from EP

From: Oza Oza
Date: Sat Aug 19 2017 - 15:32:14 EST


On Sat, Aug 19, 2017 at 11:56 PM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> On Fri, Aug 04, 2017 at 09:18:15PM +0530, Oza Pawandeep wrote:
>> PCIe spec r3.1, sec 2.3.2
>> If CRS software visibility is not enabled, the RC must reissue the
>> config request as a new request.
>>
>> - If CRS software visibility is enabled,
>> - for a config read of Vendor ID, the RC must return 0x0001 data
>> - for all other config reads/writes, the RC must reissue the
>> request
>>
>> iproc PCIe Controller spec:
>> 4.7.3.3. Retry Status On Configuration Cycle
>> Endpoints are allowed to generate retry status on configuration
>> cycles. In this case, the RC needs to re-issue the request. The IP
>> does not handle this because the number of configuration cycles needed
>> will probably be less than the total number of non-posted operations
>> needed.
>>
>> When a retry status is received on the User RX interface for a
>> configuration request that was sent on the User TX interface,
>> it will be indicated with a completion with the CMPL_STATUS field set
>> to 2=CRS, and the user will have to find the address and data values
>> and send a new transaction on the User TX interface.
>> When the internal configuration space returns a retry status during a
>> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> CRS status.
>> When the CRS Software Visibility Enable register in the Root Control
>> register is enabled, the IP will return the data value to 0x0001 for
>> the Vendor ID value and 0xffff (all 1âs) for the rest of the data in
>> the request for reads of offset 0 that return with CRS status. This
>> is true for both the User RX Interface and for the Command/Status
>> interface. When CRS Software Visibility is enabled, the CMPL_STATUS
>> field of the completion on the User RX Interface will not be 2=CRS and
>> the pcie_cscrs signal will not assert on the Command/Status interface.
>>
>> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> request for any configuration offset.
>>
>> As a result of the fact, PCIe RC driver (sw) should take care of
>> retrying.
>> This patch fixes the problem, and attempts to read config space again
>> in case of PCIe code forwarding the CRS back to CPU.
>
> I think "fixes the problem" is a little bit of an overstatement. I
> think it covers up some cases of the problem, but if I understand
> correctly, we're still susceptible to data corruptions on both read
> and write:
>
> Per PCIe r3.1, sec 2.3.2, config requests that receive completions
> with Configuration Request Retry Status (CRS) should be reissued by
> the hardware *except* reads of the Vendor ID when CRS Software
> Visibility is enabled.
>
> This hardware never reissues configuration requests when it receives
> CRS completions.
>
> For config writes, there is no way for this driver to learn about
> CRS completions. A write that receives a CRS completion will not be
> retried and the data will be discarded. This is a potential data
> corruption.
>
> For config reads, this hardware returns CFG_RETRY_STATUS data when
> it receives a CRS completion for a config read, regardless of the
> address of the read or the CRS Software Visibility Enable bit. As a
> partial workaround for this, we retry in software any read that
> returns CFG_RETRY_STATUS.
>
> CFG_RETRY_STATUS could be valid data for other config registers. It
> is impossible to read such registers correctly because we can't tell
> whether the hardware (1) received a CRS completion and synthesized
> data of CFG_RETRY_STATUS or (2) received a successful completion
> with data of CFG_RETRY_STATUS.
>
> The only thing we can really do is return 0xffffffff data, which
> indicates a config read failure in the first case but is a data
> corruption in the second case.
>
> The write case is probably not that important because we have a
> similar situation on other systems. On other systems, the root
> complex automatically reissues config writes with CRS completions, but
> it may limit the number of retries. In that case, the failed write is
> probably logged as a Target Abort error, but nobody pays attention to
> that, so this isn't really much worse.
>
>> It implements iproc_pcie_config_read which gets called for Stingray,
>> Otherwise it falls back to PCI generic APIs.
>>
>> Signed-off-by: Oza Pawandeep <oza.oza@xxxxxxxxxxxx>
>> Reviewed-by: Ray Jui <ray.jui@xxxxxxxxxxxx>
>> Reviewed-by: Scott Branden <scott.branden@xxxxxxxxxxxx>
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index 0f39bd2..583cee0 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -68,6 +68,9 @@
>> #define APB_ERR_EN_SHIFT 0
>> #define APB_ERR_EN BIT(APB_ERR_EN_SHIFT)
>>
>> +#define CFG_RETRY_STATUS 0xffff0001
>> +#define CFG_RETRY_STATUS_TIMEOUT_US 500000 /* 500 milli-seconds. */
>> +
>> /* derive the enum index of the outbound/inbound mapping registers */
>> #define MAP_REG(base_reg, index) ((base_reg) + (index) * 2)
>>
>> @@ -448,6 +451,66 @@ static inline void iproc_pcie_apb_err_disable(struct pci_bus *bus,
>> }
>> }
>>
>> +static int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>> +{
>> + int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>> + unsigned int ret;
>> +
>> + /*
>> + * As per PCIe spec r3.1, sec 2.3.2, CRS Software
>> + * Visibility only affects config read of the Vendor ID.
>> + * For config write or any other config read the Root must
>> + * automatically re-issue configuration request again as a
>> + * new request.
>> + *
>> + * Iproc based PCIe RC (hw) does not retry
>> + * request on its own, so handle it here.
>> + * Note that, AXI data 0xffff0001 returned by PAXB could be
>> + * valid data in configuration space, for e.g. BAR requesting
>> + * 64bit IO memory, so we retry till timeout.
>> + * And leave to be handled by upper layers.
>> + *
>> + * Also, iproc PCIe RC does not have any effect on
>> + * CRS Software visibility bit, irrespective if it
>> + * set or unset, the RC will never re-issue any configuration
>> + * access request.
>> + */
>> + do {
>> + ret = readl(cfg_data_p);
>> + if (ret == CFG_RETRY_STATUS)
>> + udelay(1);
>> + else
>> + return PCIBIOS_SUCCESSFUL;
>> + } while (timeout--);
>> +
>> + return PCIBIOS_DEVICE_NOT_FOUND;
>> +}
>> +
>> +static void __iomem *iproc_pcie_map_ep_cfg_reg(struct iproc_pcie *pcie,
>> + unsigned int busno,
>> + unsigned int slot,
>> + unsigned int fn,
>> + int where)
>> +{
>> + u16 offset;
>> + u32 val;
>> +
>> + /* EP device access */
>> + val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
>> + (slot << CFG_ADDR_DEV_NUM_SHIFT) |
>> + (fn << CFG_ADDR_FUNC_NUM_SHIFT) |
>> + (where & CFG_ADDR_REG_NUM_MASK) |
>> + (1 & CFG_ADDR_CFG_TYPE_MASK);
>> +
>> + iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
>> + offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
>> +
>> + if (iproc_pcie_reg_is_invalid(offset))
>> + return NULL;
>> +
>> + return (pcie->base + offset);
>> +}
>
> This is a duplicate of most of the code in iproc_pcie_map_cfg_bus().
> Can you use that directly, or factor this out somehow so it's not
> repeated?
>
>> +
>> /**
>> * Note access to the configuration registers are protected at the higher layer
>> * by 'pci_lock' in drivers/pci/access.c
>> @@ -499,13 +562,48 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct pci_bus *bus,
>> return (pcie->base + offset);
>> }
>>
>> +static int iproc_pcie_config_read(struct pci_bus *bus, unsigned int devfn,
>> + int where, int size, u32 *val)
>> +{
>> + struct iproc_pcie *pcie = iproc_data(bus);
>> + unsigned int slot = PCI_SLOT(devfn);
>> + unsigned int fn = PCI_FUNC(devfn);
>> + unsigned int busno = bus->number;
>> + void __iomem *cfg_data_p;
>> + int ret;
>> +
>> + /* root complex access. */
>> + if (busno == 0)
>> + return pci_generic_config_read32(bus, devfn, where, size, val);
>> +
>> + cfg_data_p = iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);
>> +
>> + if (!cfg_data_p)
>> + return PCIBIOS_DEVICE_NOT_FOUND;
>> +
>> + ret = iproc_pcie_cfg_retry(cfg_data_p);
>> + if (ret)
>
> This relies on the fact that PCIBIOS_SUCCESSFUL == 0, which defeats
> the whole point of having a #define. You could test for
> PCIBIOS_SUCCESSFUL (but I think the restructuring below would be even
> better).
>
>> + return ret;
>> +
>> + *val = readl(cfg_data_p);
>
> This is an extra read. iproc_pcie_cfg_retry() read it once and got
> valid data (something other than CFG_RETRY_STATUS), returned
> PCIBIOS_SUCCESSFUL, and now you're reading it again.
>
> I think you should do something like this instead so you don't do the
> MMIO read any more times than necessary:
>
> static u32 iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
> {
> u32 data;
> int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>
> data = readl(cfg_data_p);
> while (data == CFG_RETRY_STATUS && timeout--) {
> udelay(1);
> data = readl(cfg_data_p);
> }
>
> if (data == CFG_RETRY_STATUS)
> data = 0xffffffff;
> return data;
> }
>
> static int iproc_pcie_config_read(...)
> {
> u32 data;
>
> ...
> data = iproc_pcie_cfg_retry(cfg_data_p);
> if (size <= 2)
> *val = (data >> (8 * (where & 3))) & ((1 << (size * 8)) - 1);
>
> In the event of a timeout, this returns PCIBIOS_SUCCESSFUL and
> 0xffffffff data. That's what most other platforms do, and most
> callers of the PCI config accessors check for that data instead of
> checking the return code to see whether the access was successful.
>
I see one problem with this.
we have Samsung NVMe which exposes 64-bit IO BAR.
0xffff0001.

and if you see __pci_read_base which does following in sequence.

pci_read_config_dword(dev, pos, &l);
pci_write_config_dword(dev, pos, l | mask);
pci_read_config_dword(dev, pos, &sz);
pci_write_config_dword(dev, pos, l);

returning 0xffffffff would not be correct in that case.
even if callers retry, they will end up in a loop if caller retries.

hence I was returning 0xffff0001, it is upto the upper layer to treat
it as data or not.
let me know if this sounds like a problem to you as well.

so in my opinion returning 0xffffffff is not an option.

> For example, pci_flr_wait() assumes that if a read of PCI_COMMAND
> returns ~0, it's because the device isn't ready yet, and we should
> wait and retry.
>
>> +
>> + if (size <= 2)
>> + *val = (*val >> (8 * (where & 3))) & ((1 << (size * 8)) - 1);
>> +
>> + return PCIBIOS_SUCCESSFUL;
>> +}
>> +
>> static int iproc_pcie_config_read32(struct pci_bus *bus, unsigned int devfn,
>> int where, int size, u32 *val)
>> {
>> int ret;
>> + struct iproc_pcie *pcie = iproc_data(bus);
>>
>> iproc_pcie_apb_err_disable(bus, true);
>> - ret = pci_generic_config_read32(bus, devfn, where, size, val);
>> + if (pcie->type == IPROC_PCIE_PAXB_V2)
>> + ret = iproc_pcie_config_read(bus, devfn, where, size, val);
>> + else
>> + ret = pci_generic_config_read32(bus, devfn, where, size, val);
>> iproc_pcie_apb_err_disable(bus, false);
>>
>> return ret;
>> --
>> 1.9.1
>>