Re: [RFC] Question about enable doorbell irq and halt_poll process

From: Tangnianyao (ICT)
Date: Sun Apr 28 2019 - 22:29:30 EST




On 2019/4/23 18:00, Marc Zyngier wrote:

Hi, Marc

> On Tue, 23 Apr 2019 08:44:17 +0100,
> "Tangnianyao (ICT)" <tangnianyao@xxxxxxxxxx> wrote:
>>
>> Hi, Marc
>
> [...]
>
>> I've learned that there's some implementation problem for the PCIe
>> controller of Hi1616, the processor of D05. The PCIe ACS was not
>> implemented properly and D05 doesn't support VM using pcie vf.
>
> My D05 completely disagrees with you:
>
> root@unassigned-hostname:~# lspci -v
> 00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
> Subsystem: Red Hat, Inc QEMU PCIe Host bridge
> Flags: fast devsel
> lspci: Unable to load libkmod resources: error -12
>
> 00:01.0 Ethernet controller: Red Hat, Inc Virtio network device (rev 01)
> Subsystem: Red Hat, Inc Virtio network device
> Flags: bus master, fast devsel, latency 0, IRQ 40
> Memory at 10040000 (32-bit, non-prefetchable) [size=4K]
> Memory at 8000000000 (64-bit, prefetchable) [size=16K]
> Expansion ROM at 10000000 [disabled] [size=256K]
> Capabilities: [98] MSI-X: Enable+ Count=3 Masked-
> Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
> Capabilities: [70] Vendor Specific Information: VirtIO: Notify
> Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
> Capabilities: [50] Vendor Specific Information: VirtIO: ISR
> Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
> Kernel driver in use: virtio-pci
>
> 00:02.0 SCSI storage controller: Red Hat, Inc Virtio block device (rev 01)
> Subsystem: Red Hat, Inc Virtio block device
> Flags: bus master, fast devsel, latency 0, IRQ 41
> Memory at 10041000 (32-bit, non-prefetchable) [size=4K]
> Memory at 8000004000 (64-bit, prefetchable) [size=16K]
> Capabilities: [98] MSI-X: Enable+ Count=2 Masked-
> Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
> Capabilities: [70] Vendor Specific Information: VirtIO: Notify
> Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
> Capabilities: [50] Vendor Specific Information: VirtIO: ISR
> Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
> Kernel driver in use: virtio-pci
>
> 00:03.0 SCSI storage controller: Red Hat, Inc Virtio SCSI (rev 01)
> Subsystem: Red Hat, Inc Virtio SCSI
> Flags: bus master, fast devsel, latency 0, IRQ 42
> Memory at 10042000 (32-bit, non-prefetchable) [size=4K]
> Memory at 8000008000 (64-bit, prefetchable) [size=16K]
> Capabilities: [98] MSI-X: Enable+ Count=4 Masked-
> Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
> Capabilities: [70] Vendor Specific Information: VirtIO: Notify
> Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
> Capabilities: [50] Vendor Specific Information: VirtIO: ISR
> Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
> Kernel driver in use: virtio-pci
>
> 00:04.0 Ethernet controller: Intel Corporation I350 Ethernet Controller Virtual Function (rev 01)
> Subsystem: Intel Corporation I350 Ethernet Controller Virtual Function
> Flags: bus master, fast devsel, latency 0
> Memory at 800000c000 (64-bit, prefetchable) [size=16K]
> Memory at 8000010000 (64-bit, prefetchable) [size=16K]
> Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
> Capabilities: [a0] Express Root Complex Integrated Endpoint, MSI 00
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [1a0] Transaction Processing Hints
> Capabilities: [1d0] Access Control Services
> Kernel driver in use: igbvf
>
> root@unassigned-hostname:~# uptime
> 05:40:45 up 27 days, 17:16, 1 user, load average: 4.10, 4.05, 4.01
>

I have make a new quirk to fix ACS problem on Hi1616, to enable VM using
pcie vf.

> For something that isn't supposed to work, it is pretty good. So
> please test it and let me know how it fares. At this stage, not
> regressing deployed HW is more important than improving the situation
> on HW that nobody has.
>
>>
>>>
>>>>
>>>>>
>>>>>> Compared to default halt_poll_ns, 500000ns, enabling doorbells is not
>>>>>> large at all.
>>>>>
>>>>> Sure. But I'm not sure this is a universal figure.
>>>>>
>>>>>>
>>>>>>> Frankly, you want to be careful with that. I'd rather enable them late
>>>>>>> and have a chance of not blocking because of another (virtual)
>>>>>>> interrupt, which saves us the doorbell business.
>>>>>>>
>>>>>>> I wonder if you wouldn't be in a better position by drastically
>>>>>>> reducing halt_poll_ns for vcpu that can have directly injected
>>>>>>> interrupts.
>>>>>>>
>>>>>>
>>>>>> If we set halt_poll_ns to a small value, the chance of
>>>>>> not blocking and vcpu scheduled out becomes larger. The cost
>>>>>> of vcpu scheduled out is quite expensive.
>>>>>> In many cases, one pcpu is assigned to only
>>>>>> one vcpu, and halt_poll_ns is set quite large, to increase
>>>>>> chance of halt_poll process got terminated. This is the
>>>>>> reason we want to set halt_poll_ns large. And We hope vlpi
>>>>>> stop halt_poll process in time.
>>>>>
>>>>> Fair enough. It is certainly realistic to strictly partition the
>>>>> system when GICv4 is in use, so I can see some potential benefit.
>>>>>
>>>>>> Though it will check whether vcpu is runnable again by
>>>>>> kvm_vcpu_check_block. Vlpi can prevent scheduling vcpu out.
>>>>>> However it's somewhat later if halt_poll_ns is larger.
>>>>>>
>>>>>>> In any case, this is something that we should measure, not guess.
>>>>>
>>>>> Please post results of realistic benchmarks that we can reproduce,
>>>>> with and without this change. I'm willing to entertain the idea, but I
>>>>> need more than just a vague number.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> M.
>>>>>
>>>>
>>>> I tested it with and without change (patch attached in the last).
>>>> halt_poll_ns is keep default, 500000ns.
>>>> I have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled"
>>>> to my test kernel, to make sure ICH_HCR_EL2.EN=1 in guest.
>>>>
>>>> netperf result:
>>>> D06 as server, intel 8180 server as client
>>>> with change:
>>>> package 512 bytes - 5400 Mbits/s
>>>> package 64 bytes - 740 Mbits/s
>>>> without change:
>>>> package 512 bytes - 5000 Mbits/s
>>>> package 64 bytes - 710 Mbits/s
>>>>
>>>> Also I have tested D06 as client, intel machine as server,
>>>> with the change, the result remains the same.
>>>
>>> So anywhere between 4% and 8% improvement. Please post results for D05
>>> once you've found one.
>>>
>>>>
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 55fe8e2..0f56904 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -2256,6 +2256,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>> if (vcpu->halt_poll_ns) {
>>>> ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>>>>
>>>> +#ifdef CONFIG_ARM64
>>>> + /*
>>>> + * When using gicv4, enable doorbell before halt pool wait
>>>> + * so that direct vlpi can stop halt poll.
>>>> + */
>>>> + if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) {
>>>> + kvm_vgic_v4_enable_doorbell(vcpu);
>>>> + }
>>>> +#endif
>>>> +
>>>
>>> Irk. No. You're now leaving doorbells enabled when going back to the
>>> guest, and that's pretty bad as the whole logic relies on doorbells
>>> being disabled if the guest can run.
>>>
>>> So please try this instead on top of mainline. And it has to be on top
>>> of mainline as we've changed the way timer interrupts work in 5.1 --
>>> see accb99bcd0ca ("KVM: arm/arm64: Simplify bg_timer programming").
>>>
>
> [...]
>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index f25aa98a94df..0ae4ad5dcb12 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -2252,6 +2252,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>> bool waited = false;
>>> u64 block_ns;
>>>
>>> + kvm_arch_vcpu_blocking(vcpu);
>>> +
>>> start = cur = ktime_get();
>>> if (vcpu->halt_poll_ns) {
>>> ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>>> @@ -2272,8 +2274,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>> } while (single_task_running() && ktime_before(cur, stop));
>>> }
>>>
>>> - kvm_arch_vcpu_blocking(vcpu);
>>> -
>>> for (;;) {
>>> prepare_to_swait_exclusive(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>>>
>>> @@ -2287,8 +2287,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>> finish_swait(&vcpu->wq, &wait);
>>> cur = ktime_get();
>>>
>>> - kvm_arch_vcpu_unblocking(vcpu);
>>> out:
>>> + kvm_arch_vcpu_unblocking(vcpu);
>>> block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>>>
>>> if (!vcpu_valid_wakeup(vcpu))
>>>
>>> Thanks,
>>>
>>> M.
>>>
>>
>> I've tested your patch and the results showed as follow:
>>
>> netperf result:
>> D06 as server, intel 8180 server as client
>> with change:
>> package 512 bytes - 5500 Mbits/s
>> package 64 bytes - 760 Mbits/s
>> without change:
>> package 512 bytes - 5000 Mbits/s
>> package 64 bytes - 710 Mbits/s
>
> OK, that's pretty good. Let me know once you've tested it on D05.
>
> Thanks,
>
> M.
>

The average cost of kvm_vgic_v4_enable_doorbell on D05 is 0.74us,
while it is 0.35us on D06.

netperf result:
server: D05 vm using 82599 vf,
client: intel 8180
5.0.0-rc3, have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN
if GICv4 is enabled"

with change:
package 512 bytes - 7150 Mbits/s
package 64 bytes - 1080 Mbits/s
without change:
package 512 bytes - 7050 Mbits/s
package 64 bytes - 1080 Mbits/s

It seems not work on D05, as the doorbell enable process costs more than
that on D06.

Thanks,
Tang