Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand
From: Marc Zyngier
Date: Wed Apr 10 2024 - 04:46:44 EST
On Mon, 08 Apr 2024 09:29:31 +0100,
Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 05/04/2024 04:58, Gavin Shan wrote:
> > KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
> > pages are collected by VMM and the page table entries become write
> > protected during live migration. Unfortunately, the operand passed
> > to the TLBI RANGE instruction isn't correctly sorted out due to the
> > commit 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()").
> > It leads to crash on the destination VM after live migration because
> > TLBs aren't flushed completely and some of the dirty pages are missed.
> >
> > For example, I have a VM where 8GB memory is assigned, starting from
> > 0x40000000 (1GB). Note that the host has 4KB as the base page size.
> > In the middile of migration, kvm_tlb_flush_vmid_range() is executed
> > to flush TLBs. It passes MAX_TLBI_RANGE_PAGES as the argument to
> > __kvm_tlb_flush_vmid_range() and __flush_s2_tlb_range_op(). SCALE#3
> > and NUM#31, corresponding to MAX_TLBI_RANGE_PAGES, isn't supported
> > by __TLBI_RANGE_NUM(). In this specific case, -1 has been returned
> > from __TLBI_RANGE_NUM() for SCALE#3/2/1/0 and rejected by the loop
> > in the __flush_tlb_range_op() until the variable @scale underflows
> > and becomes -9, 0xffff708000040000 is set as the operand. The operand
> > is wrong since it's sorted out by __TLBI_VADDR_RANGE() according to
> > invalid @scale and @num.
> >
> > Fix it by extending __TLBI_RANGE_NUM() to support the combination of
> > SCALE#3 and NUM#31. With the changes, [-1 31] instead of [-1 30] can
> > be returned from the macro, meaning the TLBs for 0x200000 pages in the
> > above example can be flushed in one shoot with SCALE#3 and NUM#31. The
> > macro TLBI_RANGE_MASK is dropped since no one uses it any more. The
> > comments are also adjusted accordingly.
>
> Perhaps I'm being overly pedantic, but I don't think the bug is
> __TLBI_RANGE_NUM() not being able to return 31; It is clearly documented that it
> can only return in the range [-1, 30] and a maximum of (MAX_TLBI_RANGE_PAGES -
> 1) pages are supported.
I guess "clearly" is pretty relative. I find it misleading that we
don't support the full range of what the architecture offers and have
these odd limitations.
> The bug is in the kvm caller, which tries to call __flush_tlb_range_op() with
> MAX_TLBI_RANGE_PAGES; clearly out-of-bounds.
Nobody disputes that point, hence the Fixes: tag pointing to the KVM
patch. But there are two ways to fix it: either reduce the amount KVM
can use for range invalidation, or fix the helper to allow the full
range offered by the architecture.
> So personally, I would prefer to fix the bug first. Then separately
> enhance the infrastructure to support NUM=31.
I don't think this buys us much, apart from making it harder for
people to know what they need/want/randomly-elect to backport.
> > Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
>
> I would argue that the bug was actually introduced by commit 360839027a6e
> ("arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range"), which
> separated the tlbi loop from the range size validation in __flush_tlb_range().
> Before this, all calls would have to go through __flush_tlb_range() and
> therefore anything bigger than (MAX_TLBI_RANGE_PAGES - 1) pages would cause the
> whole mm to be flushed. Although I get that bisect will lead to this one, so
> that's probably the right one to highlight.
I haven't tried to bisect it, only picked this as the obviously
culprit.
To your point, using __flush_tlb_range() made little sense for KVM --
what would be the vma here? Splitting the helper out was, I think the
correct decision. But we of course lost sight of the __TLBI_RANGE_NUM
limitation in the process.
> I get why it was split, but perhaps it should have been split at a higher level;
> If tlbi range is not supported, then KVM will flush the whole vmid. Would it be
> better for KVM to follow the same pattern as __flush_tlb_range_nosync() and
> issue per-block tlbis upto a max of MAX_DVM_OPS before falling back to the whole
> vmid? And if tlbi range is supported, KVM uses it regardless of the size of the
> range, whereas __flush_tlb_range_nosync() falls back to flush_tlb_mm() at a
> certain size. It's not clear why this divergence is useful?
That'd be a definitive improvement indeed, and would bring back some
much needed consistency.
> > Cc: stable@xxxxxxxxxx # v6.6+
> > Reported-by: Yihuang Yu <yihyu@xxxxxxxxxx>
> > Suggested-by: Marc Zyngier <maz@xxxxxxxxxx>
> > Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
>
> Anyway, the implementation looks correct, so:
>
> Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
Thanks for that!
M.
--
Without deviation from the norm, progress is not possible.