Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

From: Emanuele Giuseppe Esposito
Date: Thu Oct 13 2022 - 07:12:30 EST




Am 13/10/2022 um 10:44 schrieb David Hildenbrand:
> On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
>>> If we really want to provide a better experience for userspace, why
>>> not provide
>>> more primitives to address those specific use cases?  E.g. fix KVM's
>>> historic wart
>>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more
>>> ioctls to:
>>>
>>>    - Merge 2+ memory regions that are contiguous in GPA and HVA
>>>    - Split a memory region into 2+ memory regions
>>>    - Truncate a memory region
>>>    - Grow a memory region
>>
>> I looked again at the code and specifically the use case that triggers
>> the crash in bugzilla. I *think* (correct me if I am wrong), that the
>> only operation that QEMU performs right now is "grow/shrink".
>
> I remember that there were BUG reports where we'd actually split and run
> into that problem. Just don't have them at hand. I think they happened
> during early boot when the OS re-configured some PCI thingies.

If you could point me where this is happening, it would be nice. So far
I could not find or see any split/merge operation.

>
>>
>> So *if* we want to go this way, we could start with a simple grow/shrink
>> API.
>>
>> Even though we need to consider that this could bring additional
>> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
>> performed one after the other (in my innocent fantasy I was expecting to
>> find 2 subsequent ioctls in the code), but in QEMU's
>> address_space_set_flatview(), which seems to first remove all regions
>> and then add them when changing flatviews.
>>
>> address_space_update_topology_pass(as, old_view2, new_view,
>> adding=false);
>> address_space_update_topology_pass(as, old_view2, new_view, adding=true);
>>
>> I don't think we can change this, as other listeners also rely on such
>> ordering, but we can still batch all callback requests like I currently
>> do and process them in kvm_commit(), figuring there which operation is
>> which.
>>
>> In other words, we would have something like:
>>
>> region_del() --> DELETE memslot X -> add it to the queue of operations
>> region_del() --> DELETE memslot Y -> add it to the queue of operations
>> region_add() --> CREATE memslot X (size doubled) -> add it to the queue
>> of operations
>> region_add() --> CREATE memslot Y (size halved) -> add it to the queue
>> of operations
>> ...
>> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
>> Y (-size) -> issue 2 ioctls, GROW and SHRINK.
>
> I communicated resizes (region_resize()) to the notifier in patch #3 of
> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@xxxxxxxxxx/
>
> You could use that independently of the remainder. It should be less
> controversial ;)
>
> But I think it only tackles part of the more generic problem (split/merge).

Yes, very useful! Using that patch we would have a single place where to
issue grow/shrink ioctls. I don't think we need to inhibit ioctls, since
the operation will be atomic (this time in the true meaning, since we
don't need INVALIDATE.

>
>>
>>> That would probably require more KVM code overall, but each operation
>>> would be more
>>> tightly bounded and thus simpler to define.  And I think more precise
>>> APIs would
>>> provide other benefits, e.g. growing a region wouldn't require first
>>> deleting the
>>> current region, and so could avoid zapping SPTEs and destroying
>>> metadata.  Merge,
>>> split, and truncate would have similar benefits, though they might be
>>> more
>>> difficult to realize in practice.
>>
>> So essentially grow would not require INVALIDATE. Makes sense, but would
>> it work also with shrink? I guess so, as the memslot is still present
>> (but shrinked) right?
>>
>> Paolo, would you be ok with this smaller API? Probably just starting
>> with grow and shrink first.
>>
>> I am not against any of the two approaches:
>> - my approach has the disadvantage that the list could be arbitrarily
>> long, and it is difficult to rollback the intermediate changes if
>> something breaks during the request processing (but could be simplified
>> by making kvm exit or crash).
>>
>> - Sean approach could potentially provide more burden to the userspace,
>> as we need to figure which operation is which. Also from my
>> understanding split and merge won't be really straightforward to
>> implement, especially in userspace.
>>
>> David, any concern from userspace prospective on this "CISC" approach?
>
> In contrast to resizes in QEMU that only affect a single memory
> region/slot, splitting/merging is harder to factor out and communicate
> to a notifier. As an alternative, we could handle it in the commit stage
> in the notifier itself, similar to what my prototype does, and figure
> out what needs to be done in there and how to call the proper KVM
> interface (and which interface to call).
>
> With virtio-mem (in the future) we might see merges of 2 slots into a
> single one, by closing a gap in-between them. In "theory" we could
> combine some updates into a single transaction. But it won't be 100s ...
>
> I think I'd prefer an API that doesn't require excessive ad-hoc
> extensions later once something in QEMU changes.
>
>
> I think in essence, all we care about is performing atomic changes that
> *have to be atomic*, because something we add during a transaction
> overlaps with something we remove during a transaction. Not necessarily
> all updates in a transaction!
>
> My prototype essentially detects that scenario, but doesn't call new KVM
> interfaces to deal with these.

With "prototype" I assume you mean the patch linked above
(region_resize), not the userspace-only proposal you sent initially right?

>
> I assume once we take that into consideration, we can mostly assume that
> any such atomic updates (only of the regions that really have to be part
> of an atomic update) won't involve more than a handful of memory
> regions. We could add a sane KVM API limit.
>
> And if we run into that API limit in QEMU, we can print a warning and do
> it non-atomically.
>
If we implement single operations (split/merge/grow/shrink), we don't
even need that limit. Except for merge, maybe.

Ok, if it'ok for you all I can try to use David patch and implement some
simple grow/shrink. Then we need to figure where and when exactly QEMU
performs split and merge operations, and maybe implement something
similar to what you did in your proposal?

Thank you,
Emanuele