Re: [PATCH v3] riscv: patch: Fixup lockdep warning in stop_machine

From: Conor Dooley
Date: Tue Jan 31 2023 - 02:51:01 EST


On Tue, Jan 31, 2023 at 03:26:33PM +0800, Guo Ren wrote:
> On Mon, Jan 30, 2023 at 11:10 PM Conor Dooley
> <conor.dooley@xxxxxxxxxxxxx> wrote:
> >
> > Hey Changbin,
> >
> > On Tue, Jan 31, 2023 at 07:26:59AM +0800, Changbin Du wrote:
> > > From: Changbin Du <changbin.du@xxxxxxxxx>
> > >
> > > The task of ftrace_arch_code_modify(_post)_prepare() caller is
> > > stop_machine, whose caller and work thread are of different tasks. The
> > > lockdep checker needs the same task context, or it's wrong. That means
> > > it's a bug here to use lockdep_assert_held because we don't guarantee
> > > the same task context.
> > >
> > > kernel/locking/lockdep.c:
> > > int __lock_is_held(const struct lockdep_map *lock, int read)
> > > {
> > > struct task_struct *curr = current;
> > > int i;
> > >
> > > for (i = 0; i < curr->lockdep_depth; i++) {
> > > ^^^^^^^^^^^^^^^^^^^
> > > struct held_lock *hlock = curr->held_locks + i;
> > > ^^^^^^^^^^^^^^^^
> > > if (match_held_lock(hlock, lock)) {
> > > if (read == -1 || !!hlock->read == read)
> > > return LOCK_STATE_HELD;
> > >
> > > The __lock_is_held depends on current held_locks records; if
> > > stop_machine makes the checker runing on another task, that's wrong.
> > >
> > > Here is the log:
> > > [ 15.761523] ------------[ cut here ]------------
> > > [ 15.762125] WARNING: CPU: 0 PID: 15 at arch/riscv/kernel/patch.c:63 patch_insn_write+0x72/0x364
> > > [ 15.763258] Modules linked in:
> > > [ 15.764154] CPU: 0 PID: 15 Comm: migration/0 Not tainted 6.1.0-rc1-00014-g66924be85884-dirty #377
> > > [ 15.765339] Hardware name: riscv-virtio,qemu (DT)
> > > [ 15.765985] Stopper: multi_cpu_stop+0x0/0x192 <- stop_cpus.constprop.0+0x90/0xe2
> > > [ 15.766711] epc : patch_insn_write+0x72/0x364
> > > [ 15.767011] ra : patch_insn_write+0x70/0x364
> > > [ 15.767276] epc : ffffffff8000721e ra : ffffffff8000721c sp : ff2000000067bca0
> > > [ 15.767622] gp : ffffffff81603f90 tp : ff60000002432a00 t0 : 7300000000000000
> > > [ 15.767919] t1 : 0000000000000000 t2 : 73695f6b636f6c5f s0 : ff2000000067bcf0
> > > [ 15.768238] s1 : 0000000000000008 a0 : 0000000000000000 a1 : 0000000000000000
> > > [ 15.768537] a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000
> > > [ 15.768837] a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000000000000
> > > [ 15.769139] s2 : ffffffff80009faa s3 : ff2000000067bd10 s4 : ffffffffffffffff
> > > [ 15.769447] s5 : 0000000000000001 s6 : 0000000000000001 s7 : 0000000000000003
> > > [ 15.769740] s8 : 0000000000000002 s9 : 0000000000000004 s10: 0000000000000003
> > > [ 15.770027] s11: 0000000000000002 t3 : 0000000000000000 t4 : ffffffff819af097
> > > [ 15.770323] t5 : ffffffff819af098 t6 : ff2000000067ba28
> > > [ 15.770574] status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
> > > [ 15.771102] [<ffffffff80007520>] patch_text_nosync+0x10/0x3a
> > > [ 15.771421] [<ffffffff80009c66>] ftrace_update_ftrace_func+0x74/0x10a
> > > [ 15.771704] [<ffffffff800fa17e>] ftrace_modify_all_code+0xb0/0x16c
> > > [ 15.771958] [<ffffffff800fa24c>] __ftrace_modify_code+0x12/0x1c
> > > [ 15.772196] [<ffffffff800e110e>] multi_cpu_stop+0x14a/0x192
> > > [ 15.772454] [<ffffffff800e0a34>] cpu_stopper_thread+0x96/0x14c
> > > [ 15.772699] [<ffffffff8003f4ea>] smpboot_thread_fn+0xf8/0x1cc
> > > [ 15.772945] [<ffffffff8003ac9c>] kthread+0xe2/0xf8
> > > [ 15.773160] [<ffffffff80003e98>] ret_from_exception+0x0/0x14
> > > [ 15.773471] ---[ end trace 0000000000000000 ]---
> >
> > FWIW, you can always crop the [15.192321] stuff out of commit messages,
> > as it just adds noise.
> >
> > > By the way, this also fixes the same issue for patch_text().
> > >
> > > Fixes: 0ff7c3b33127 ("riscv: Use text_mutex instead of patch_lock")
> > > Co-developed-by: Guo Ren <guoren@xxxxxxxxxx>
> > > Signed-off-by: Guo Ren <guoren@xxxxxxxxxx>
> > > Cc: Zong Li <zong.li@xxxxxxxxxx>
> > > Cc: Palmer Dabbelt <palmer@xxxxxxxxxxx>
> > > Signed-off-by: Changbin Du <changbin.du@xxxxxxxxxx>
> > > ---
> > > Changes in v3:
> > > - denote this also fixes function patch_text().
> > >
> > > Changes in v2:
> > > - Rewrite commit log with lockdep explanation [Guo Ren]
> > > - Rebase on v6.1 [Guo Ren]
> > >
> > > v1:
> > > https://lore.kernel.org/linux-riscv/20210417023532.354714-1-changbin.du@xxxxxxxxx/
> > > ---
> > > arch/riscv/kernel/patch.c | 7 -------
> > > 1 file changed, 7 deletions(-)
> > >
> > > diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
> > > index 765004b60513..8619706f8dfd 100644
> > > --- a/arch/riscv/kernel/patch.c
> > > +++ b/arch/riscv/kernel/patch.c
> > > @@ -55,13 +55,6 @@ static int patch_insn_write(void *addr, const void *insn, size_t len)
> > > bool across_pages = (((uintptr_t) addr & ~PAGE_MASK) + len) > PAGE_SIZE;
> > > int ret;
> > >
> > > - /*
> > > - * Before reaching here, it was expected to lock the text_mutex
> > > - * already, so we don't need to give another lock here and could
> > > - * ensure that it was safe between each cores.
> > > - */
> > > - lockdep_assert_held(&text_mutex);
> >
> > I must admit, patches like this do concern me a little, as a someone
> > unfamiliar with the world of probing and tracing.
> > Seeing an explicit check that the lock was held, leads me to believe
> > that the original author (Zong Li I think) thought that the text_mutex
> > lock was insufficient.
> > Do you think that their fear is unfounded? Explaining why it is safe to
> > remove this assertion in the commit message would go a long way towards
> > easing my anxiety!
> >
> > Also, why delete the comment altogether? The comment provides some
> > information that doesn't appear to become invalid, even with the
> > assertion removed?
> Stop_machine separated the mutex context and made a lockdep warning.
> So text_mutex can't be used here. We need to find another check
> solution. I agree with the patch.

Whether or not you agree with the change is not the point (with your SoB
I'd hope you agree with it).
I understand that you two are trying to fix a false positive lockdep
warning, but what I am asking for an explanation as to why the original
author's fear is unfounded.
Surely, having added the assertion, they were not thinking of the same
code path that you guys are hitting the false positive on?

Perhaps Zong themselves can tell us what the original fear was?

Thanks,
Conor.

Attachment: signature.asc
Description: PGP signature