Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

From: Alexey Kardashevskiy
Date: Tue Sep 03 2013 - 12:01:53 EST


On 09/03/2013 08:53 PM, Gleb Natapov wrote:
> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
>> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
>>>> them to user space which saves time on switching to user space and back.
>>>>
>>>> Both real and virtual modes are supported. The kernel tries to
>>>> handle a TCE request in the real mode, if fails it passes the request
>>>> to the virtual mode to complete the operation. If it a virtual mode
>>>> handler fails, the request is passed to user space.
>>>>
>>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>>>> user API functions are required for this patch.
>>>>
>>>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
>>>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
>>>> of map/unmap requests. The device supports a single attribute which is
>>>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
>>>> establishes the connection between KVM and VFIO.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Paul Mackerras <paulus@xxxxxxxxx>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
>>>>
>>>> ---
>>>>
>>>> Changes:
>>>> v9:
>>>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
>>>> KVM device
>>>> * release_spapr_tce_table() is not shared between different TCE types
>>>> * reduced the patch size by moving VFIO external API
>>>> trampolines to separate patche
>>>> * moved documentation from Documentation/virtual/kvm/api.txt to
>>>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>
>>>> v8:
>>>> * fixed warnings from check_patch.pl
>>>>
>>>> 2013/07/11:
>>>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
>>>> for KVM_BOOK3S_64
>>>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
>>>> for this here but the next patch for hugepages support will use it more.
>>>>
>>>> 2013/07/06:
>>>> * added realmode arch_spin_lock to protect TCE table from races
>>>> in real and virtual modes
>>>> * POWERPC IOMMU API is changed to support real mode
>>>> * iommu_take_ownership and iommu_release_ownership are protected by
>>>> iommu_table's locks
>>>> * VFIO external user API use rewritten
>>>> * multiple small fixes
>>>>
>>>> 2013/06/27:
>>>> * tce_list page is referenced now in order to protect it from accident
>>>> invalidation during H_PUT_TCE_INDIRECT execution
>>>> * added use of the external user VFIO API
>>>>
>>>> 2013/06/05:
>>>> * changed capability number
>>>> * changed ioctl number
>>>> * update the doc article number
>>>>
>>>> 2013/05/20:
>>>> * removed get_user() from real mode handlers
>>>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
>>>> translated TCEs, tries realmode_get_page() on those and if it fails, it
>>>> passes control over the virtual mode handler which tries to finish
>>>> the request handling
>>>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
>>>> on a page
>>>> * The only reason to pass the request to user mode now is when the user mode
>>>> did not register TCE table in the kernel, in all other cases the virtual mode
>>>> handler is expected to do the job
>>>> ---
>>>> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
>>>> arch/powerpc/include/asm/kvm_host.h | 4 +
>>>> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
>>>> arch/powerpc/kvm/powerpc.c | 1 +
>>>> include/linux/kvm_host.h | 1 +
>>>> virt/kvm/kvm_main.c | 5 +
>>>> 7 files changed, 477 insertions(+), 3 deletions(-)
>>>> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>> new file mode 100644
>>>> index 0000000..4bc8fc3
>>>> --- /dev/null
>>>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>> @@ -0,0 +1,37 @@
>>>> +SPAPR TCE IOMMU device
>>>> +
>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>> +Architectures: powerpc
>>>> +
>>>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
>>>> +
>>>> +Groups:
>>>> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
>>>> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
>>>> +
>>>> +This is completely made up device which provides API to link
>>>> +logical bus number (LIOBN) and IOMMU group. The user space has
>>>> +to create a new SPAPR TCE IOMMU device per a logical bus.
>>>> +
>>> Why not have one device that can handle multimple links?
>>
>>
>> I can do that. If I make it so, it won't even look as a device at all, just
>> some weird interface to KVM but ok. What bothers me is it is just a
> May be I do not understand usage pattern here. Why do you feel that device
> that can handle multiple links is worse than device per link? How many logical
> buses is there usually? How often they created/destroyed? I am not insisting
> on the change, just trying to understand why you do not like it.


Is it usually one PCI host bus adapter per IOMMU group which is usually
one PCI card or 2-3 cards if it is a legacy PCI-X, and they are created
when QEMU-KVM starts. Not many. And they live till KVM ends.

My point is why would I want to put all links to one device? It all is just
a matter of taste and nothing more. Or I am missing something but I do not
see what. If it is all about making thing to be kosher/halal/orthodox, then
I have more stuff to do, like reworking the emulated TCEs. But if is it for
(I do not know, just guessing) performance or something like that - then
I'll fix it, I just need to know what I am fixing.



>> question what I will have to do next. Because I can easily predict a
>> suggestion to move kvmppc_spapr_tce_table's (a links list) from
>> kvm->arch.spapr_tce_tables to that device but I cannot do that for obvious
>> compatibility reasons caused by the fact that the list is already used for
>> emulated devices (for the starter - they need mmap()).
>>
>> Or supporting all IOMMU links (and leaving emulated stuff as is) in on
>> "device" is the last thing I have to do and then you'll ack the patch?
>>
> I am concerned more about API here. Internal implementation details I
> leave to powerpc experts :)


The Expert (Ben) wants capabilities number and API to get fixed in KVM tree :)


>
>>
>>
>>>> +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
>>>> +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
>>>> +IOMMU group is a minimal isolated device set which can be passed to
>>>> +the user space via VFIO.
>>>> +
>>>> +Right after creation the device is in uninitlized state and requires
>>>> +a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set.
>>>> +The attribute contains liobn, IOMMU fd and flags:
>>>> +
>>>> +struct kvm_create_spapr_tce_iommu_linkage {
>>>> + __u64 liobn;
>>>> + __u32 fd;
>>>> + __u32 flags;
>>>> +};
>>>> +
>>>> +The user space creates the SPAPR TCE IOMMU device, obtains
>>>> +an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU
>>>> +device. At the moment of setting the attribute, the SPAPR TCE IOMMU
>>>> +device links LIOBN to IOMMU group and makes necessary steps
>>>> +to make sure that VFIO group will not disappear before KVM destroys.
>>>> +
>>>> +The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
>>> [skip]
>>
>> Yes, I read the other comment. So roughly speaking I'll replace the
>> KVM_CAP_SPAPR_TCE_IOMMU check with the KVM_CAP_DEVICE_CTRL capability check
>> + try to KVM_CREATE_DEVICE with the KVM_CREATE_DEVICE_TEST flag set, and we
>> are fine.
> Yes, but KVM_CREATE_DEVICE_TEST does not create device, only checks if
> device type is supported.

Sure.



--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/