Re: [PATCH] kgdb: Resolve races during kgdb_io_register/unregister_module

From: Daniel Thompson
Date: Tue Jun 30 2020 - 11:05:34 EST


On Mon, Jun 29, 2020 at 02:03:52PM -0700, Doug Anderson wrote:
> Hi,
>
> On Mon, Jun 29, 2020 at 10:15 AM Daniel Thompson
> <daniel.thompson@xxxxxxxxxx> wrote:
> >
> > Currently kgdb_register_callbacks() and kgdb_unregister_callbacks()
> > are called outside the scope of the kgdb_registration_lock. This
> > allows them to race with each other. This could do all sorts of crazy
> > things up to and including dbg_io_ops becoming NULL partway through the
> > execution of the kgdb trap handler (which isn't allowed and would be
> > fatal).
> >
> > Fix this by bringing the trap handler setup and teardown into the scope
> > of the registration lock.
> >
> > Signed-off-by: Daniel Thompson <daniel.thompson@xxxxxxxxxx>
> > ---
> > kernel/debug/debug_core.c | 8 +++++---
> > 1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> > index 9e5934780f41..9799f2c6dc94 100644
> > --- a/kernel/debug/debug_core.c
> > +++ b/kernel/debug/debug_core.c
> > @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)
> >
> > dbg_io_ops = new_dbg_io_ops;
> >
> > - spin_unlock(&kgdb_registration_lock);
> > -
> > if (old_dbg_io_ops) {
> > + spin_unlock(&kgdb_registration_lock);
> > old_dbg_io_ops->deinit();
> > return 0;
> > }
> > @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops)
> > /* Arm KGDB now. */
> > kgdb_register_callbacks();
> >
> > + spin_unlock(&kgdb_registration_lock);
>
> From looking at code paths, I think this is illegal, isn't it? You're
> now calling kgdb_register_callbacks() while holding a spinlock, but:
>
> kgdb_register_callbacks()
> -> register_console()
> --> console_lock()
> ---> might_sleep()
> ----> <boom!>

Thanks.

I very nearly didn't press "Send" yesterday because I was worried I was
rushing it too much (in order to avoid forgetting it ;-) ). Should have
listened to myself!


> I'm a little curious about the exact race we're trying to solve.
> Calling unregister on an IO module before register even finished seems
> like an error on the caller, so I guess it would be calling register
> from a 2nd thread for a different IO module while the first thread was
> partway through unregistering? Even that seems awfully sketchy since
> you're risking registering a 2nd IO ops while the first is still there
> and that's illegal enough that we do a pr_err() for it (though we
> don't crash), but let's say we're trying to solve that one.

I didn't follow all the possible paths. Utlimately the
(un)register_callbacks() functions use a flag variable without a lock
and that can interact in lots of different ways.

To be honest none are especially likely because the normal case is to
register once during boot and never unregister. However we can trigger
register/unregister from userspace so I think they can happen
in parallel.

Double unregister can lead to some especially nasty schedules...
although they still remain pretty unlikely since we need the double
unregister to coincide with a breakpoint:


kgdb_unregister_callbacks() kgdb_unregister_callbacks()
. .
test flag .
set flag to 0 .
. test flag
. spin_lock()
*** kgdb trap *** .
. paranoid dbg_io_ops check .
. dbg_io_ops = NULL
. stop other CPUs
. try to use NULL dbg_io_ops


I have drawn the kgdb trap in the first column because otherwise things
get too wide but the trap could trigger on any CPU in the system and
provoke the problem.


>
> Looking at it closely, I _think_ the only race in this case is if the
> one we're trying to unregister had a deinit() function and we going to
> replace it? If it didn't have a deinit function:
>
> cpu1 (unregister) cpu2 (register):
> ----------------- ----------------------
> kgdb_unregister_callbacks()
> spin_lock() <got>
> spin_lock() <blocked>
> if (old_dbg_io_ops) <true>
> if (has dinit) <false>
> print error
> spin_unlock()
> return -EBUSY
> <finish unregister>
>
> The above is fine and is the same thing that would happen if the
> whole register function ran before the unregister even started, right?
>
> Also: if the unregister won the race that should also be fine.
>
> So really the problem is this:
>
> cpu1 (unregister) cpu2 (register):
> ----------------- ----------------------
> kgdb_unregister_callbacks()
> spin_lock() <got>
> spin_lock() <blocked>
> if (old_dbg_io_ops) <true>
> if (has dinit) <true>
> print Replacing
> init new IO ops
> spin_unlock()
> if (old_dbg_io_ops) <true>
> finish deinit of old
> return true
> WARN_ON() <hits and shouts!>
> dbg_io_ops = NULL
> spin_unlock()
> if (deinit) <true>
> double-call to deinit of old
>
> So in this case we'll hit a WARN_ON(), incorrectly unregister the new
> IO ops, and call deinit twice.

To be honest I was simply working on "it is racy" and "there's not a
good reason to allow that", especially as we start to develop tools to
bring races to the surfaces someone will yell at us about it sooner or
later ;-).

Of course, implementing it correctly would have been better...


Daniel.