Re: [PATCH] panic: fix incomplete panic log in panic()

From: Qing Z
Date: Wed Oct 17 2012 - 06:44:24 EST


2012/10/16 Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>:
> On Mon, 15 Oct 2012 19:38:46 +0800
> Qing Z <njumical@xxxxxxxxx> wrote:
>
>> >> atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
>> >>
>> >> + /*
>> >> + * Unlock the console anyway here, in case it's occupied by another
>> >> + * one which has no chance to unlock the console thus prevents the
>> >> + * panic log prints on the console.
>> >> + */
>> >> + console_unlock();
>> >> +
>> >> bust_spinlocks(0);
>> >>
>> >> if (!panic_blink)
>> >
>> > hm. console_unlock() does a large amount of work, and it seems risky
>> > to do all of that when the system is in a bad-and-getting-worse state.
>> >
>> > Is there some more modest thing we could do here, for example,
>> >
>> > if (console_locked) {
>> > up(&console_sem);
>> > console_locked = 0;
>> > }
>> >
>> > or something along those lines?
>> >
>> > Also, perhaps this operation should be moved into bust_spinlocks().
>> > What would have happened if your code had triggered an oops, rather
>> > than called panic()?
>> >
>> >
>> Hi Andrew,
>> Thanks for your reply!
>> For your question" What would have happened if your code had
>> triggered an oops, rather than called panic()?", actually we found the
>> issue when trigger an oops. When we call FBIOPAN_DISPLAY in
>> ./drivers/video/fbmem.c, it will first lock console, if we trigger an
>> oops before unlock console, the issue happen. It also exist when call
>> panic() directly in the same case. It is a common issue for panic
>> process.
>> I have two options for solution:
>> 1. I agree with your suggestion that add some modest thing in
>> bust_spinlocks(), bust_spinlocks() is supposed to clear spinlocks
>> which prevent oops information from reaching the user. But it didn't
>> clear console_sem. We can add codes that clear console_sem.
>> 1) add up(&console_sem) in bust_spinlocks(0).
>> It will be risky in case that no printk after bust_spinlocks(0) in
>> panic(), because no console_unlock() to print log out.
>> 2) call console_unlock()in bust_spinlocks(0).
>> For bust_spinlocks(0), console_unblank() is used to flush oops to
>> mtdoops console(commit: b61312d353da1871778711040464b10f5cd904df).
>> Logically, if panic without the issue, console_unlock is called after
>> couples of console_lock and console_unlock; if panic with the issue,
>> will it be risky call console_unlock() in console_unblank() after
>> console_lock()?
>> 2. Moreover, there is another option. We can also add protect codes
>> in vprintk(), vprintk() just cover the cases that two cores' log
>> interleave when panic and printk recurse itself. We can add all cases'
>> protection here. Actually the original vprintk() don___t have the issue,
>> but after the patch(commit: fe21773d655c2c64641ec2cef499289ea175c817)
>> which fix two cores' log interleave issue , the issue is not covered.
>> I add a flag after panic_smp_self_stop() in panic(), and check the
>> flag, if flag is set, vprintk will call zap_locks(), I have tested the
>> option, the issue also disappear.
>> What do you think?
>
> The #1 priority is to get the oops message reliably delivered.
>
> That means we should avoid console_unlock() on the oops path: it's far
> too complicated and risks deadlocks, re-oopses, recursion, etc.
>
> If there was text queued in the console layer and that text fails to be
> emitted, well, that's sad, but it's more important that the oops
> message be displayed.
>
> If the oops trace is occasionally interleaved with other text then
> that's sad too, but at least the info we need is readable. Oopses
> inside console_lock() are rare.
>
>
> So I'd suggest that the code in bust_spinlocks(1) should simply do
> whatever needs to be done to make the forthcoming oops trace be
> visible, and leave it at that - don't bother trying to flush out any
> old text.
>
>
> Also, we should be careful with things like up() on a semaphore which
> hasn't been down()ed. Because under some Kconfig combinations, such an
> operation might trigger debugging traces and we could get into a big
> mess. (An up() on non-down()ed semaphore is actually an OK operation,
> so this was a bad example. But you see the problem).
>
Hi Andrew,
Thanks for your suggestions.
I am e newbie, may not state the issue clearly. Let me state it again:
In ./drivers/video/fbmem.c, codes below cause issues:

case FBIOPAN_DISPLAY:
...
console_lock();
ret = fb_pan_display(info, &var);
console_unlock();
...
break;

issue case 1:
1. core 0 call console_lock();
2. panic;
...
4. panic process done.
Result: all panic log won't be printed.

issue case 2:
1. core 0 panic;
2. core 1 call console_lock();
3. core 0 call smp_send_stop(), core1 stop;
4. core 0 panic process done.
Result: only little top part of panic log will be printed.

My soluiton according to your suggestions:

As you said, the first priority is to get oops message reliably
delivered. I think we needn't care about console_sem when panic, just
make sure we print the log imediately, so add
sema_init(&console_sem,1) in bust_spinlocks(0), just like zap_locks()
do. It is safer than console_unlock() or up().

We can't add sema_init(..) in bust_spinlocks(1) due to issue case2,
although the condition is rare. About issue case 2: should we avoid
call console_lock() when panic?

If we init console_sem in panic, old text may be flushed too, but
should be before panic oops message. Also we can fix it by updating
con_start("con_start = log_end") once panic happen, only log after
panic will be printed.


What do you think? Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/