Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible
From: John Garry
Date: Mon Aug 21 2023 - 07:35:51 EST
On 16/08/2023 16:10, Robin Murphy wrote:
On 15/08/2023 2:35 pm, John Garry wrote:
On 15/08/2023 12:11, Robin Murphy wrote:
This threshold is the number of online CPUs, right?
Yes, that's nominally half of the current fixed size (based on all
the performance figures from the original series seemingly coming
from a 16-thread machine,
If you are talking about
https://urldefense.com/v3/__https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@xxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!Op6GUnd7phh1sFyJwVOngmoeyKKbHWbSsNkhPB_7BpG45JFOHmN0HQ0Y7NOZZQ7VduKXaRYCXTta8LjrS99neyg$ ,
No, I mean the *original* rcache patch submission, and its associated
paper:
https://urldefense.com/v3/__https://lore.kernel.org/linux-iommu/cover.1461135861.git.mad@xxxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!Op6GUnd7phh1sFyJwVOngmoeyKKbHWbSsNkhPB_7BpG45JFOHmN0HQ0Y7NOZZQ7VduKXaRYCXTta8LjrOGggfnA$
oh, that one :)
then I think it's a 256-CPU system and the DMA controller has 16 HW
queues. The 16 HW queues are relevant as the per-completion queue
interrupt handler runs on a fixed CPU from the set of 16 CPUs in the
HW queue interrupt handler affinity mask. And what this means is while
any CPU may alloc an IOVA, only those 16 CPUs handling each HW queue
interrupt will be free'ing IOVAs.
but seemed like a fair compromise. I am of course keen to see how
real-world testing actually pans out.
it's enough of a challenge to get my 4-core dev board with spinning
disk
and gigabit ethernet to push anything into a depot at all 😄
I have to admit that I was hoping to also see a more aggressive
reclaim strategy, where we also trim the per-CPU rcaches when not in
use. Leizhen proposed something like this a long time ago.
Don't think I haven't been having various elaborate ideas for making
it cleverer with multiple thresholds and self-tuning, however I have
managed to restrain myself 😉
OK, understood. My main issue WRT scalability is that the total
cacheable IOVAs (CPU and depot rcache) scales up with the number of
CPUs, but many DMA controllers have a fixed number of max in-flight
requests.
Consider a SCSI storage controller on a 256-CPU system. The in-flight
limit for this example controller is 4096, which would typically never
be even used up or may not be even usable.
For this device, we need 4096 * 6 [IOVA rcache range] = ~24K cached
IOVAs if we were to pre-allocate them all - obviously I am ignoring
that we have the per-CPU rcache for speed and it would not make sense
to share one set. However, according to current IOVA driver, we can in
theory cache upto ((256 [CPUs] * 2 [loaded + prev]) + 32 [depot size])
* 6 [rcache range] * 128 (IOVA per mag) = ~420K IOVAs. That's ~17x
what we would ever need.
Something like NVMe is different, as its total requests can scale up
with the CPU count, but only to a limit. I am not sure about network
controllers.
Remember that this threshold only represents a point at which we
consider the cache to have grown "big enough" to start background
reclaim - over the short term it is neither an upper nor a lower limit
on the cache capacity itself. Indeed it will be larger than the working
set of some workloads, but then it still wants to be enough of a buffer
to be useful for others which do make big bursts of allocations only
periodically.
It would be interesting to see what zhangzekun finds for this series. He
was testing on a 5.10-based kernel - things have changed a lot since
then and I am not really sure what the problem could have been there.
Anyway, this is just something which I think should be considered -
which I guess already has been.
Indeed, I would tend to assume that machines with hundreds of CPUs are
less likely to be constrained on overall memory and/or IOVA space,
Cheers,
John