I think comparison is not entirely fair. You're using
KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
(on Intel) to only one register read:
nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
Whereas in a real hypercall for (say) PIO you would need the address,
size, direction and data.
Also for PIO/MMIO you're adding this unoptimized lookup to the measurement:
pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
if (pio_dev) {
kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
complete_pio(vcpu); return 1;
}
Whereas for hypercall measurement you don't. I believe a fair comparison
would be have a shared guest/host memory area where you store guest/host
TSC values and then do, on guest:
rdtscll(&shared_area->guest_tsc);
pio/mmio/hypercall
... back to host
rdtscll(&shared_area->host_tsc);
And then calculate the difference (minus guests TSC_OFFSET of course)?