Re: [PATCH 3/3] tile pci: enable IOMMU to support DMA for legacy devices

From: Bjorn Helgaas
Date: Wed Jul 18 2012 - 12:50:59 EST


On Wed, Jul 18, 2012 at 10:15 AM, Chris Metcalf <cmetcalf@xxxxxxxxxx> wrote:
> On 7/13/2012 1:25 PM, Bjorn Helgaas wrote:
>> On Fri, Jul 13, 2012 at 11:52:11AM -0400, Chris Metcalf wrote:
>>> On 6/22/2012 7:24 AM, Bjorn Helgaas wrote:
>>>> This says that your entire physical address space (currently
>>>> 0x0-0xffffffff_ffffffff) is routed to the PCI bus, which is not true. I
>>>> think what you want here is pci_iomem_resource, but I'm not sure that's
>>>> set up correctly. It should contain the CPU physical address that are
>>>> routed to the PCI bus. Since you mention an offset, the PCI bus
>>>> addresses will "CPU physical address - offset".
>>> Yes, we've changed it to use pci_iomem_resource. On TILE-Gx, there are two
>>> types of CPU physical addresses: physical RAM addresses and MMIO addresses.
>>> The MMIO address has the MMIO attribute in the page table. So, the physical
>>> address spaces for the RAM and the PCI are completely separate. Instead, we
>>> have the following relationship: PCI bus address = PCI resource address -
>>> offset, where the PCI resource addresses are defined by pci_iomem_resource
>>> and they are never generated by the CPU.
>> Does that mean the MMIO addresses are not accessible when the CPU
>> is in physical mode, and you can only reach them via a virtual address
>> mapped with the MMIO attribute? If so, then I guess you're basically
>> combining RAM addresses and MMIO addresses into iomem_resource by
>> using high "address bits" to represent the MMIO attribute?
>
> Yes.
>
>>> The TILE-Gx chip’s CHIP_PA_WIDTH is 40-bit. In the following example, the
>>> system has 32GB RAM installed, with 16GB in each of the 2 memory
>>> controllers. For the first mvsas device, its PCI memory resource is
>>> [0x100c0000000, 0x100c003ffff], the corresponding PCI bus address range is
>>> [0xc0000000, 0xc003ffff] after subtracting the offset of (1ul << 40). The
>>> aforementioned PCI MMIO address’s low 32-bits contains the PCI bus address.
>>>
>>> # cat /proc/iomem
>>> 00000000-3fbffffff : System RAM
>>> 00000000-007eeb1f : Kernel code
>>> 00860000-00af6e4b : Kernel data
>>> 4000000000-43ffffffff : System RAM
>>> 100c0000000-100c003ffff : mvsas
>>> 100c0040000-100c005ffff : mvsas
>>> 100c0200000-100c0203fff : sky2
>>> 100c0300000-100c0303fff : sata_sil24
>>> 100c0304000-100c030407f : sata_sil24
>>> 100c0400000-100c0403fff : sky2
>>>
>>> Note that in above example, the 2 mvsas devices are in a separate PCI
>>> domain than the other 4 devices.
>> It sounds like you're describing something like this:
>>
>> host bridge 0
>> resource [mem 0x100_c0000000-0x100_c00fffff] (offset 0x100_00000000)
>> bus addr [mem 0xc0000000-0xc00fffff]
>> host bridge 2
>> resource [mem 0x100_c0200000-0x100_c02fffff] (offset 0x100_00000000)
>> bus addr [mem 0xc0200000-0xc02fffff]
>> host bridge 3
>> resource [mem 0x100_c0300000-0x100_c03fffff] (offset 0x100_00000000)
>> bus addr [mem 0xc0300000-0xc03fffff]
>>
>> If PCI bus addresses are simply the low 32 bits of the MMIO address,
>> there's nothing in the PCI core that should prevent you from giving a
>> full 4GB of bus address space to each bridge, e.g.:
>>
>> host bridge 0
>> resource [mem 0x100_00000000-0x100_ffffffff] (offset 0x100_00000000)
>> bus addr [mem 0x00000000-0xffffffff]
>> host bridge 2
>> resource [mem 0x102_00000000-0x102_ffffffff] (offset 0x102_00000000)
>> bus addr [mem 0x00000000-0xffffffff]
>> host bridge 3
>> resource [mem 0x103_00000000-0x103_ffffffff] (offset 0x103_00000000)
>> bus addr [mem 0x00000000-0xffffffff]
>
> Good idea! But we can’t use all the low addresses, i.e. a 4GB BAR window
> won’t work because we must leave some space, i.e. the low 3GB in our case,
> to allow the 32-bit devices access to the RAM. If the low 32-bit space is
> all used for BAR, the host bridge won’t pass any DMA traffic to and from
> the low 4GB RAM. We are going to use a separate MMIO range in [3GB, 4GB –
> 1] for each host bridge, with offset 0x10N_00000000 (see appended revised
> patch).

OK. Interesting that the PIO (coming from CPU) and DMA (coming from
device) address spaces interact in this way.

>>> We use the same pci_iomem_resource for different domains or host
>>> bridges, but the MMIO apertures for each bridge do not overlap because
>>> non-overlapping resource ranges are allocated for each domains.
>> You should not use the same pci_iomem_resource for different host bridges
>> because that tells the PCI core that everything in pci_iomem_resource is
>> available for devices under every host bridge, which I doubt is the case.
>>
>> The fact that your firmware assigns non-overlapping resources is good and
>> works now, but if the kernel ever needs to allocate resources itself,
>
> Actually, we were not using any firmware. It was indeed the kernel which
> allocates resources from the shared pci_iomem_resource.

Wow. I wonder how that managed to work. Is there some information
that would have helped the PCI core do the right allocations? Or
maybe the host bridges forward everything they receive to PCI,
regardless of address, and any given MMIO address is only routed to
one of the host bridges because of the routing info in the page
tables? I guess in that case, the "apertures" would basically be
defined by the page tables, not by the host bridges. But that still
doesn't explain how we would assign non-overlapping ranges to each
domain.

Oh, well, I guess I don't need to understand that. But I *am* glad
that you updated the actual apertures to be separate, because we're
changing the core allocation routines, and if the apertures are
separate, we'll be less likely to break something for you.

>> the only way to do it correctly is to know what the actual apertures are
>> for each host bridge. Eventually, I think the host bridges will also
>> show up in /proc/iomem, which won't work if their apertures overlap.
>
> Fixed. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/