Re: [PATCH] panic: release stale console lock to always get the logbuf printed out

From: Jan Kara
Date: Thu Oct 08 2015 - 05:02:12 EST


On Wed 07-10-15 15:34:08, Andrew Morton wrote:
> (cc Jan)
>
> On Wed, 7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote:
>
> > In some cases we may end up killing the CPU holding the console lock
> > while still having valuable data in logbuf. E.g. I'm observing the
> > following:
> > - A crash is happening on one CPU and console_unlock() is being called on
> > some other.
> > - console_unlock() tries to print out the buffer before releasing the lock
> > and on slow console it takes time.
> > - in the meanwhile crashing CPU does lots of printk()-s with valuable data
> > (which go to the logbuf) and sends IPIs to all other CPUs.
> > - console_unlock() finishes printing previous chunk and enables interrupts
> > before trying to print out the rest, the CPU catches the IPI and never
> > releases console lock.
>
> Why doesn't the lock-owning CPU release the console lock? Because it
> was stopped by smp_send_stop() in panic()?
>
> I don't recall why we stop CPUs in panic(), and of course we didn't
> document the reason. I guess it makes sense from the "what else can we
> do" point of view, but I wonder if we can just do it later on - that
> would fix this problem?
>
> (dumb aside: why doesn't smp_send_stop() stop the calling CPU?)
>
> > This is not the only possible case: in VT/fb subsystems we have many other
> > console_lock()/console_unlock() users. Non-masked interrupts (or receiving
> > NMI in case of extreme slowness) will have the same result. Getting the
> > whole console buffer printed out on crash should be top priority.
>
> Yes, this is a pretty big hole in the logic.
>
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -23,6 +23,7 @@
> > #include <linux/sysrq.h>
> > #include <linux/init.h>
> > #include <linux/nmi.h>
> > +#include <linux/console.h>
> >
> > #define PANIC_TIMER_STEP 100
> > #define PANIC_BLINK_SPD 18
> > @@ -147,6 +148,15 @@ void panic(const char *fmt, ...)
> >
> > bust_spinlocks(0);
> >
> > + /*
> > + * We may have ended up killing the CPU holding the lock and still have
> > + * some valuable data in console buffer. Try to acquire the lock and
> > + * release it regardless of the result. The release will also print the
> > + * buffers out.
> > + */
> > + console_trylock();
> > + console_unlock();
> > +
>
> "killing the CPU" is a bit vague. How's this look?
>
> --- a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix
> +++ a/kernel/panic.c
> @@ -149,10 +149,10 @@ void panic(const char *fmt, ...)
> bust_spinlocks(0);
>
> /*
> - * We may have ended up killing the CPU holding the lock and still have
> - * some valuable data in console buffer. Try to acquire the lock and
> - * release it regardless of the result. The release will also print the
> - * buffers out.
> + * We may have ended up stopping the CPU holding the lock (in
> + * smp_send_stop()) while still having some valuable data in the console
> + * buffer. Try to acquire the lock then release it regardless of the
> + * result. The release will also print the buffers out.
> */
> console_trylock();
> console_unlock();
> _
>
>
> Does the console_trylock() guarantee that the console lock is now held?
> If the console_lock-holding CPU is still running then there's a window
> where the above code could enter console_unlock() when nobody's holding
> console_lock. If smp_send_stop() always works (synchronously) then
> that won't happen.

We have this mechanism using zap_locks() in kernel/printk/printk.c when
crash happens on the CPU holding console_sem. Can't we use the same
mechanism for this case? Something like adding:

zap_locks();
console_lock();
console_unlock();

to panic? If we picked up patch "kernel: Avoid softlockups in
stop_machine() during heavy printing" from my series (it's completely
independent, I've attached the latest version), the result would look less
hacky to me (attached). Thoughts?

Warning, the result is untested... I can do some testing and official
posting of the two patches if people agree we want to got down this path.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR