Re: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet

From: Mathias Krause
Date: Tue Feb 14 2023 - 07:19:59 EST


On 13.02.23 18:05, Sean Christopherson wrote:
> On Mon, Feb 13, 2023, Mathias Krause wrote:
>> Relayout members of struct kvm_vcpu and embedded structs to reduce its
>> memory footprint. Not that it makes sense from a memory usage point of
>> view (given how few of such objects get allocated), but this series
>> achieves to make it consume two cachelines less, which should provide a
>> micro-architectural net win. However, I wasn't able to see a noticeable
>> difference running benchmarks within a guest VM -- the VMEXIT costs are
>> likely still high enough to mask any gains.
>
> ...
>
>> Below is the high level pahole(1) diff. Most significant is the overall
>> size change from 6688 to 6560 bytes, i.e. -128 bytes.
>
> While part of me wishes KVM were more careful about struct layouts, IMO fiddling
> with per vCPU or per VM structures isn't worth the ongoing maintenance cost.
>
> Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses
> a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation,
> the memory savings are negligible in the grand scheme. Assuming the kernel is
> even capable of perfectly packing vCPU allocations, saving even a few hundred bytes
> per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that
> point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again
> uninteresting.

Fully agree! That's why I said, this change makes no sense from a memory
usage point of view. The overall memory savings are not visible at all,
recognizing that the slab allocator isn't able to put more vCPU objects
in a given slab page. However, I still remain confident that this makes
sense from a uarch point of view. Touching less cache lines should be a
win -- even if I'm unable to measure it. By preserving more cachelines
during a VMEXIT, guests should be able to resume their work faster
(assuming they still need these cachelines).

> And as you observed, imperfect struct layouts are highly unlikely to have a
> measurable impact on performance. The types of operations that are involved in
> a world switch are just too costly for the layout to matter much. I do like to
> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
> more performant, doesn't require ongoing mainteance, and/or also improves the code
> quality.

Any pointers to measure the "more performant" aspect? I tried to make
use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's already
counting cycles, but the numbers are too unstable, even if I pin the
test to a given CPU, disable turbo mode, SMT, use the performance cpu
governor, etc.

> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
> a sub-optimal layouy and the change is arguably warranted even without the change
> in size. Ditto for kvm_pmu, logically I think it makes sense to have the version
> at the very top.

Yeah, was exactly thinking the same when modifying kvm_pmu.

> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
> egregious field(s) just isn't worth the cost in the long term.

Heh, just found this gem in vcpu_vmx:

struct vcpu_vmx {
[...]
union vmx_exit_reason exit_reason;

/* XXX 44 bytes hole, try to pack */

/* --- cacheline 123 boundary (7872 bytes) --- */
struct pi_desc pi_desc __attribute__((__aligned__(64)));
[...]

So there are, in fact, some bigger holes left.

Would be nice if pahole had a --density flag that would output some
ASCII art, visualizing which bytes of a struct are allocated by real
members and which ones are pure padding.