Re: [PATCH] panic: release stale console lock to always get the logbuf printed out

From: Vitaly Kuznetsov
Date: Thu Oct 08 2015 - 05:51:24 EST


Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> writes:

> (cc Jan)
>
> On Wed, 7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote:
>
>> In some cases we may end up killing the CPU holding the console lock
>> while still having valuable data in logbuf. E.g. I'm observing the
>> following:
>> - A crash is happening on one CPU and console_unlock() is being called on
>> some other.
>> - console_unlock() tries to print out the buffer before releasing the lock
>> and on slow console it takes time.
>> - in the meanwhile crashing CPU does lots of printk()-s with valuable data
>> (which go to the logbuf) and sends IPIs to all other CPUs.
>> - console_unlock() finishes printing previous chunk and enables interrupts
>> before trying to print out the rest, the CPU catches the IPI and never
>> releases console lock.
>
> Why doesn't the lock-owning CPU release the console lock? Because it
> was stopped by smp_send_stop() in panic()?
>
> I don't recall why we stop CPUs in panic(), and of course we didn't
> document the reason. I guess it makes sense from the "what else can we
> do" point of view, but I wonder if we can just do it later on - that
> would fix this problem?

We don't know for how long should we wait for the other CPU to finish
the output and it can take some time. In case we're rebooting after a
short timeout we can still end up with something in the logbuf.

>
> (dumb aside: why doesn't smp_send_stop() stop the calling CPU?)
>

Because we do send_IPI_allbutself() ? :-)

>> This is not the only possible case: in VT/fb subsystems we have many other
>> console_lock()/console_unlock() users. Non-masked interrupts (or receiving
>> NMI in case of extreme slowness) will have the same result. Getting the
>> whole console buffer printed out on crash should be top priority.
>
> Yes, this is a pretty big hole in the logic.
>
>> --- a/kernel/panic.c
>> +++ b/kernel/panic.c
>> @@ -23,6 +23,7 @@
>> #include <linux/sysrq.h>
>> #include <linux/init.h>
>> #include <linux/nmi.h>
>> +#include <linux/console.h>
>>
>> #define PANIC_TIMER_STEP 100
>> #define PANIC_BLINK_SPD 18
>> @@ -147,6 +148,15 @@ void panic(const char *fmt, ...)
>>
>> bust_spinlocks(0);
>>
>> + /*
>> + * We may have ended up killing the CPU holding the lock and still have
>> + * some valuable data in console buffer. Try to acquire the lock and
>> + * release it regardless of the result. The release will also print the
>> + * buffers out.
>> + */
>> + console_trylock();
>> + console_unlock();
>> +
>
> "killing the CPU" is a bit vague. How's this look?
>
> --- a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix
> +++ a/kernel/panic.c
> @@ -149,10 +149,10 @@ void panic(const char *fmt, ...)
> bust_spinlocks(0);
>
> /*
> - * We may have ended up killing the CPU holding the lock and still have
> - * some valuable data in console buffer. Try to acquire the lock and
> - * release it regardless of the result. The release will also print the
> - * buffers out.
> + * We may have ended up stopping the CPU holding the lock (in
> + * smp_send_stop()) while still having some valuable data in the console
> + * buffer. Try to acquire the lock then release it regardless of the
> + * result. The release will also print the buffers out.

Thanks, looks better)

> */
> console_trylock();
> console_unlock();
> _
>
> Does the console_trylock() guarantee that the console lock is now held?
> If the console_lock-holding CPU is still running
> then there's a window
> where the above code could enter console_unlock() when nobody's holding
> console_lock. If smp_send_stop() always works (synchronously) then
> that won't happen.

Well, in smp_send_stop() we send IPIs and if some CPU is still running
we send NMIs... In case someone ignored that I'm not sure what can be
done...

--
Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/