Re: [PATCH 3/3] tile pci: enable IOMMU to support DMA for legacydevices

From: Chris Metcalf
Date: Wed Jul 18 2012 - 12:15:46 EST


On 7/13/2012 1:25 PM, Bjorn Helgaas wrote:
> On Fri, Jul 13, 2012 at 11:52:11AM -0400, Chris Metcalf wrote:
>> On 6/22/2012 7:24 AM, Bjorn Helgaas wrote:
>>> This says that your entire physical address space (currently
>>> 0x0-0xffffffff_ffffffff) is routed to the PCI bus, which is not true. I
>>> think what you want here is pci_iomem_resource, but I'm not sure that's
>>> set up correctly. It should contain the CPU physical address that are
>>> routed to the PCI bus. Since you mention an offset, the PCI bus
>>> addresses will "CPU physical address - offset".
>> Yes, we've changed it to use pci_iomem_resource. On TILE-Gx, there are two
>> types of CPU physical addresses: physical RAM addresses and MMIO addresses.
>> The MMIO address has the MMIO attribute in the page table. So, the physical
>> address spaces for the RAM and the PCI are completely separate. Instead, we
>> have the following relationship: PCI bus address = PCI resource address -
>> offset, where the PCI resource addresses are defined by pci_iomem_resource
>> and they are never generated by the CPU.
> Does that mean the MMIO addresses are not accessible when the CPU
> is in physical mode, and you can only reach them via a virtual address
> mapped with the MMIO attribute? If so, then I guess you're basically
> combining RAM addresses and MMIO addresses into iomem_resource by
> using high "address bits" to represent the MMIO attribute?

Yes.

>> The TILE-Gx chipâs CHIP_PA_WIDTH is 40-bit. In the following example, the
>> system has 32GB RAM installed, with 16GB in each of the 2 memory
>> controllers. For the first mvsas device, its PCI memory resource is
>> [0x100c0000000, 0x100c003ffff], the corresponding PCI bus address range is
>> [0xc0000000, 0xc003ffff] after subtracting the offset of (1ul << 40). The
>> aforementioned PCI MMIO addressâs low 32-bits contains the PCI bus address.
>>
>> # cat /proc/iomem
>> 00000000-3fbffffff : System RAM
>> 00000000-007eeb1f : Kernel code
>> 00860000-00af6e4b : Kernel data
>> 4000000000-43ffffffff : System RAM
>> 100c0000000-100c003ffff : mvsas
>> 100c0040000-100c005ffff : mvsas
>> 100c0200000-100c0203fff : sky2
>> 100c0300000-100c0303fff : sata_sil24
>> 100c0304000-100c030407f : sata_sil24
>> 100c0400000-100c0403fff : sky2
>>
>> Note that in above example, the 2 mvsas devices are in a separate PCI
>> domain than the other 4 devices.
> It sounds like you're describing something like this:
>
> host bridge 0
> resource [mem 0x100_c0000000-0x100_c00fffff] (offset 0x100_00000000)
> bus addr [mem 0xc0000000-0xc00fffff]
> host bridge 2
> resource [mem 0x100_c0200000-0x100_c02fffff] (offset 0x100_00000000)
> bus addr [mem 0xc0200000-0xc02fffff]
> host bridge 3
> resource [mem 0x100_c0300000-0x100_c03fffff] (offset 0x100_00000000)
> bus addr [mem 0xc0300000-0xc03fffff]
>
> If PCI bus addresses are simply the low 32 bits of the MMIO address,
> there's nothing in the PCI core that should prevent you from giving a
> full 4GB of bus address space to each bridge, e.g.:
>
> host bridge 0
> resource [mem 0x100_00000000-0x100_ffffffff] (offset 0x100_00000000)
> bus addr [mem 0x00000000-0xffffffff]
> host bridge 2
> resource [mem 0x102_00000000-0x102_ffffffff] (offset 0x102_00000000)
> bus addr [mem 0x00000000-0xffffffff]
> host bridge 3
> resource [mem 0x103_00000000-0x103_ffffffff] (offset 0x103_00000000)
> bus addr [mem 0x00000000-0xffffffff]

Good idea! But we canât use all the low addresses, i.e. a 4GB BAR window
wonât work because we must leave some space, i.e. the low 3GB in our case,
to allow the 32-bit devices access to the RAM. If the low 32-bit space is
all used for BAR, the host bridge wonât pass any DMA traffic to and from
the low 4GB RAM. We are going to use a separate MMIO range in [3GB, 4GB â
1] for each host bridge, with offset 0x10N_00000000 (see appended revised
patch).

>> We use the same pci_iomem_resource for different domains or host
>> bridges, but the MMIO apertures for each bridge do not overlap because
>> non-overlapping resource ranges are allocated for each domains.
> You should not use the same pci_iomem_resource for different host bridges
> because that tells the PCI core that everything in pci_iomem_resource is
> available for devices under every host bridge, which I doubt is the case.
>
> The fact that your firmware assigns non-overlapping resources is good and
> works now, but if the kernel ever needs to allocate resources itself,

Actually, we were not using any firmware. It was indeed the kernel which
allocates resources from the shared pci_iomem_resource.

> the only way to do it correctly is to know what the actual apertures are
> for each host bridge. Eventually, I think the host bridges will also
> show up in /proc/iomem, which won't work if their apertures overlap.

Fixed. Thanks!

diff --git a/arch/tile/include/asm/pci.h b/arch/tile/include/asm/pci.h
index 553b7ff..1ab2a58 100644
--- a/arch/tile/include/asm/pci.h
+++ b/arch/tile/include/asm/pci.h
@@ -128,15 +128,10 @@ static inline void pci_iounmap(struct pci_dev *dev, void __iomem *addr) {}
#define TILE_PCI_MEM_MAP_BASE_OFFSET (1ULL << CHIP_PA_WIDTH())

/*
- * End of the PCI memory resource.
+ * Start of the PCI memory resource, which starts at the end of the
+ * maximum system physical RAM address.
*/
-#define TILE_PCI_MEM_END \
- ((1ULL << CHIP_PA_WIDTH()) + TILE_PCI_BAR_WINDOW_TOP)
-
-/*
- * Start of the PCI memory resource.
- */
-#define TILE_PCI_MEM_START (TILE_PCI_MEM_END - TILE_PCI_BAR_WINDOW_SIZE)
+#define TILE_PCI_MEM_START (1ULL << CHIP_PA_WIDTH())

/*
* Structure of a PCI controller (host bridge) on Gx.
@@ -161,6 +156,7 @@ struct pci_controller {

uint64_t mem_offset; /* cpu->bus memory mapping offset. */

+ int first_busno;
int last_busno;

struct pci_ops *ops;
@@ -169,6 +165,7 @@ struct pci_controller {
int irq_intx_table[4];

struct resource mem_space;
+ char mem_name[32];

/* Address ranges that are routed to this controller/bridge. */
struct resource mem_resources[3];
@@ -179,14 +176,6 @@ extern gxio_trio_context_t trio_contexts[TILEGX_NUM_TRIO];

extern void pci_iounmap(struct pci_dev *dev, void __iomem *);

-extern void
-pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
- struct resource *res);
-
-extern void
-pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
- struct pci_bus_region *region);
-
/*
* The PCI address space does not equal the physical memory address
* space (we have an IOMMU). The IDE and SCSI device layers use this
diff --git a/arch/tile/kernel/pci_gx.c b/arch/tile/kernel/pci_gx.c
index 27f7ab0..7d854b5 100644
--- a/arch/tile/kernel/pci_gx.c
+++ b/arch/tile/kernel/pci_gx.c
@@ -96,21 +96,6 @@ static struct pci_ops tile_cfg_ops;
/* Mask of CPUs that should receive PCIe interrupts. */
static struct cpumask intr_cpus_map;

-/* PCI I/O space support is not implemented. */
-static struct resource pci_ioport_resource = {
- .name = "PCI IO",
- .start = 0,
- .end = 0,
- .flags = IORESOURCE_IO,
-};
-
-static struct resource pci_iomem_resource = {
- .name = "PCI mem",
- .start = TILE_PCI_MEM_START,
- .end = TILE_PCI_MEM_END,
- .flags = IORESOURCE_MEM,
-};
-
/*
* We don't need to worry about the alignment of resources.
*/
@@ -440,6 +425,23 @@ out:
controller->last_busno = 0xff;
controller->ops = &tile_cfg_ops;

+ /*
+ * The PCI memory resource is located above the PA space.
+ * For every host bridge, the BAR window or the MMIO aperture
+ * is in range [3GB, 4GB - 1] of a 4GB space beyond the
+ * PA space.
+ */
+
+ controller->mem_offset = TILE_PCI_MEM_START +
+ (i * TILE_PCI_BAR_WINDOW_TOP);
+ controller->mem_space.start = controller->mem_offset +
+ TILE_PCI_BAR_WINDOW_TOP - TILE_PCI_BAR_WINDOW_SIZE;
+ controller->mem_space.end = controller->mem_offset +
+ TILE_PCI_BAR_WINDOW_TOP - 1;
+ controller->mem_space.flags = IORESOURCE_MEM;
+ snprintf(controller->mem_name, sizeof(controller->mem_name),
+ "PCI mem domain %d", i);
+ controller->mem_space.name = controller->mem_name;
}

return num_rc_controllers;
@@ -588,6 +590,7 @@ int __init pcibios_init(void)
{
resource_size_t offset;
LIST_HEAD(resources);
+ int next_busno;
int i;

tile_pci_init();
@@ -628,7 +631,7 @@ int __init pcibios_init(void)
msleep(250);

/* Scan all of the recorded PCI controllers. */
- for (i = 0; i < num_rc_controllers; i++) {
+ for (next_busno = 0, i = 0; i < num_rc_controllers; i++) {
struct pci_controller *controller = &pci_controllers[i];
gxio_trio_context_t *trio_context = controller->trio;
TRIO_PCIE_INTFC_PORT_CONFIG_t port_config;
@@ -843,13 +846,14 @@ int __init pcibios_init(void)
* The memory range for the PCI root bus should not overlap
* with the physical RAM
*/
- pci_add_resource_offset(&resources, &iomem_resource,
- 1ULL << CHIP_PA_WIDTH());
+ pci_add_resource_offset(&resources, &controller->mem_space,
+ controller->mem_offset);

- bus = pci_scan_root_bus(NULL, 0, controller->ops,
+ controller->first_busno = next_busno;
+ bus = pci_scan_root_bus(NULL, next_busno, controller->ops,
controller, &resources);
controller->root_bus = bus;
- controller->last_busno = bus->subordinate;
+ next_busno = bus->subordinate + 1;

}

@@ -1011,20 +1015,9 @@ alloc_mem_map_failed:
}
subsys_initcall(pcibios_init);

-/*
- * PCI scan code calls the arch specific pcibios_fixup_bus() each time it scans
- * a new bridge. Called after each bus is probed, but before its children are
- * examined.
- */
+/* Note: to be deleted after Linux 3.6 merge. */
void __devinit pcibios_fixup_bus(struct pci_bus *bus)
{
- struct pci_dev *dev = bus->self;
-
- if (!dev) {
- /* This is the root bus. */
- bus->resource[0] = &pci_ioport_resource;
- bus->resource[1] = &pci_iomem_resource;
- }
}

/*
@@ -1124,7 +1117,10 @@ void __iomem *ioremap(resource_size_t phys_addr, unsigned long size)
got_it:
trio_fd = controller->trio->fd;

- offset = HV_TRIO_PIO_OFFSET(controller->pio_mem_index) + phys_addr;
+ /* Convert the resource start to the bus address offset. */
+ start = phys_addr - controller->mem_offset;
+
+ offset = HV_TRIO_PIO_OFFSET(controller->pio_mem_index) + start;

/*
* We need to keep the PCI bus address's in-page offset in the VA.
@@ -1172,11 +1168,11 @@ static int __devinit tile_cfg_read(struct pci_bus *bus,
void *mmio_addr;

/*
- * Map all accesses to the local device (bus == 0) into the
+ * Map all accesses to the local device on root bus into the
* MMIO space of the MAC. Accesses to the downstream devices
* go to the PIO space.
*/
- if (busnum == 0) {
+ if (pci_is_root_bus(bus)) {
if (device == 0) {
/*
* This is the internal downstream P2P bridge,
@@ -1205,11 +1201,11 @@ static int __devinit tile_cfg_read(struct pci_bus *bus,
}

/*
- * Accesses to the directly attached device (bus == 1) have to be
+ * Accesses to the directly attached device have to be
* sent as type-0 configs.
*/

- if (busnum == 1) {
+ if (busnum == (controller->first_busno + 1)) {
/*
* There is only one device off of our built-in P2P bridge.
*/
@@ -1303,11 +1299,11 @@ static int __devinit tile_cfg_write(struct pci_bus *bus,
u8 val_8 = (u8)val;

/*
- * Map all accesses to the local device (bus == 0) into the
+ * Map all accesses to the local device on root bus into the
* MMIO space of the MAC. Accesses to the downstream devices
* go to the PIO space.
*/
- if (busnum == 0) {
+ if (pci_is_root_bus(bus)) {
if (device == 0) {
/*
* This is the internal downstream P2P bridge,
@@ -1336,11 +1332,11 @@ static int __devinit tile_cfg_write(struct pci_bus *bus,
}

/*
- * Accesses to the directly attached device (bus == 1) have to be
+ * Accesses to the directly attached device have to be
* sent as type-0 configs.
*/

- if (busnum == 1) {
+ if (busnum == (controller->first_busno + 1)) {
/*
* There is only one device off of our built-in P2P bridge.
*/
diff --git a/arch/tile/kernel/setup.c b/arch/tile/kernel/setup.c
index 2b8b689..6a649a4 100644
--- a/arch/tile/kernel/setup.c
+++ b/arch/tile/kernel/setup.c
@@ -1536,8 +1536,7 @@ static struct resource code_resource = {

/*
* On Pro, we reserve all resources above 4GB so that PCI won't try to put
- * mappings above 4GB; the standard allows that for some devices but
- * the probing code trunates values to 32 bits.
+ * mappings above 4GB.
*/
#if defined(CONFIG_PCI) && !defined(__tilegx__)
static struct resource* __init
@@ -1584,7 +1583,6 @@ static int __init request_standard_resources(void)
int i;
enum { CODE_DELTA = MEM_SV_INTRPT - PAGE_OFFSET };

- iomem_resource.end = -1LL;
#if defined(CONFIG_PCI) && !defined(__tilegx__)
insert_non_bus_resource();
#endif

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/