Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

From: Guo Ren
Date: Mon Feb 27 2023 - 22:12:59 EST


On Tue, Feb 28, 2023 at 6:40 AM Gary Guo <gary@xxxxxxxxxxx> wrote:
>
> On Sat, 25 Feb 2023 23:24:40 -0500
> Guo Ren <guoren@xxxxxxxxxx> wrote:
>
> > On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <geomatsi@xxxxxxxxx> wrote:
> > >
> > > On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> > > > Lad, Prabhakar <prabhakar.csengg@xxxxxxxxx> 於 2022年12月23日 週五 下午8:54寫道:
> > > > >
> > > > > Hi Guo,
> > > > >
> > > > > Thank you for the patch.
> > > > >
> > > > > On Fri, Nov 11, 2022 at 8:00 AM <guoren@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> > > > > >
> > > > > > After use_asid_allocator is enabled, the userspace application will
> > > > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > > > > Then set_mm_asid would cause the user space application to get a stale
> > > > > > value by stale TLB entry, but set_mm_noasid is okay.
> > > > > >
> > > > > > Here is the symptom of the bug:
> > > > > > unhandled signal 11 code 0x1 (coredump)
> > > > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70
> > > > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> > > > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> > > > > > (gdb) i r s0
> > > > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> > > > > > (gdb) x /2x 0x3fd6d92490
> > > > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> > > > > > The core dump file shows that register s0 is wrong, but the value in
> > > > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > > > > in TLB and got a wrong result from an incorrect physical address.
> > > > > >
> > > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > > > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > > > > PTWed into CPU0's TLB.
> > > > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > > > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > > > > ptep_get_and_clear & flush_tlb_page.
> > > > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > > > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > > > > still used a stale TLB mapping entry which contained a wrong target
> > > > > > physical address. It raised a bug when the task happened to read that
> > > > > > value.
> > > > > >
> > > > > > CPU0 CPU1
> > > > > > - switch 'task' in
> > > > > > - read addr (Fill stale mapping
> > > > > > entry into TLB)
> > > > > > - switch 'task' out (no tlb_flush)
> > > > > > - switch 'task' in (no tlb_flush)
> > > > > > - write addr cause pagefault
> > > > > > do_page_fault() (change to
> > > > > > new addr mapping)
> > > > > > wp_page_copy()
> > > > > > ptep_clear_flush()
> > > > > > ptep_get_and_clear()
> > > > > > & flush_tlb_page()
> > > > > > write new value into addr
> > > > > > - switch 'task' out (no tlb_flush)
> > > > > > - switch 'task' in (no tlb_flush)
> > > > > > - read addr again (Use stale
> > > > > > mapping entry in TLB)
> > > > > > get wrong value from old phyical
> > > > > > addr, BUG!
> > > > > >
> > > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > > > > which could guarantee to invalidate all stale TLB entries during TLB
> > > > > > flush.
> > > > > >
> > > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > > > > Signed-off-by: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> > > > > > Signed-off-by: Guo Ren <guoren@xxxxxxxxxx>
> > > > > > Cc: Anup Patel <apatel@xxxxxxxxxxxxxxxx>
> > > > > > Cc: Palmer Dabbelt <palmer@xxxxxxxxxxxx>
> > > > > > ---
> > > > > > Changes in v3:
> > > > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > > > > with Andrew's advice)
> > > > > > - Optimize comment description
> > > > > >
> > > > > > Changes in v2:
> > > > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > > > > test robot <lkp@xxxxxxxxx>)
> > > > > > - Keep cpumask_clear_cpu for noasid
> > > > > > ---
> > > > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > > > > 1 file changed, 20 insertions(+), 10 deletions(-)
> > > > > >
> > > > > As reported on the patch [0] I was seeing consistent failures on the
> > > > > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > > > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > > > > seeing this issue.
> > > > >
> > > > > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@xxxxxxxxxxxxxx>
> > > > >
> > > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@xxxxxxxxx/
> > > > >
> > > >
> > > > Hi all,
> > > > I got the same situation (i.e. unhandle signal 11) on our internal
> > > > multi-core system, I tried the patch[0] & [1], but it still doesn't
> > > > work, I guess there are still some potential problems. After applying
> > > > this patch, the situation disappeared, I took some time to look at
> > > > other arches' implementations, such as arc, they don't clear the
> > > > mm_cpumask due to the similar issue. I can't say which approach might
> > > > be better, but I'd like to point out that this patch works to me.
> > > > Thanks.
> > > >
> > > > Tested-by: Zong Li <zong.li@xxxxxxxxxx>
> > > >
> > > > [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@xxxxxxxxx/
> > > > [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@xxxxxxxxx/
> > >
> > > Thanks for the report! By the way, could you please share some
> > > information about the reproducing workload ?
> > >
> > > Initial idea was to reduce the number of TLB flushes by deferring (and
> > > possibly avoiding) some of them. But we have already bug reports from
> > > two different vendors, so apparently something is overlooked here.
> > > Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.
> > >
> > > @Guo Ren, do you mind if I re-send your v3 patch together with the
> > > remaining reverts of my changes ?
> > Okay, thx for taking care. Let's make it work around first and then improve it.
> >
> > Actually, the current riscv asid is from arm64 with hardware broadcast
> > requirements. Maybe we need to consider x86 per-cpu asid pool way.
>
> It should be noted that the spec expects supervisor software to
> use a consistent meaning of non-zero ASIDs across different harts.
>
> Also, a vendor could implement efficient hardware broadcasting ASID
> invalidation with a custom instruction and expose it via SBI.
I agree with you; actually, our XuanTie supports hardware broadcasting
invalidation, and we expect SBI style.

The x86 style would be another choice for the future, and the riscv
would combine different TLB maintenance styles into one Linux
architecture.

----

Please let me give out an outline of the current riscv state:

1. Current Linux riscv uses unified ASIDs design, which comes from arm
hw broadcast one. Some riscv vendors (XuanTie) could also support TLB
hw broadcast.

2. The riscv spec doesn't suggest hw broadcast because it's unsuitable
for large-scale systems. So x86 style would be another choice for the
future Linux riscv.

Correct? Welcome feedback.

>
> Best,
> Gary



--
Best Regards
Guo Ren