Re: [PATCH v2 00/23] KVM: arm64: Improvements to LPI injection

From: Oliver Upton
Date: Tue Feb 13 2024 - 15:12:40 EST


On Tue, Feb 13, 2024 at 09:32:37AM +0000, Oliver Upton wrote:

[...]

> Clearly the RCU synchronization is a bounding issue in this case. I
> think other scenarios where the cache is overcommitted (16 vCPUs, 16
> devices, 17 events / device) are able to hide effects somewhat, as other
> threads can make forward progress while others are stuck waiting on RCU.
>
> A few ideas on next steps:
>
> 1) Rework the lpi_list_lock as an rwlock. This would obviate the need
> for RCU protection in the LPI cache as well as memory allocations on
> the injection path. This is actually what I had in the internal
> version of the series, although it was very incomplete.
>
> I'd expect this to nullify the improvement on the
> slightly-overcommitted case and 'fix' the pathological case.
>
> 2) call_rcu() and move on. This feels somewhat abusive of the API, as
> the guest can flood the host with RCU callbacks, but I wasn't able
> to make my machine fall over in any mean configuration of the test.
>
> I haven't studied the degree to which such a malicious VM could
> adversely affect neighboring workloads.
>
> 3) Redo the whole ITS representation with xarrays and allow RCU readers
> outside of the ITS lock. I haven't fully thought this out, and if we
> pursue this option then we will need a secondary data structure to
> track where ITSes have been placed in guest memory to avoid taking
> the SRCU lock. We can then stick RCU synchronization in ITS command
> processing, which feels right to me, and dump the translation cache
> altogether.
>
> I'd expect slightly worse average case performance in favor of more
> consistent performance.

Marc and I had an off-list conversation about this and agreed on option
4!

It is somewhat similar in spirit to (3), in that KVM will maintain an
xarray translation cache per ITS, indexed by (device_id, event_id). This
will be a perfect cache that can fit the entire range addressed by the
ITS. The memory overheads of the xarray are not anticipated to be
consequential, as the ITS memory footprint already scales linearly with
the number of entries in the ITS.

Separately the DB -> ITS translation will be resolved by walking the
ITSes present in the VM.

The existing invalidation calls will be scoped to an ITS besides the
case where the guest disables LPIs on a redistributor.

--
Thanks,
Oliver