Re: [PATCH] printk: Avoid softlockups in console_unlock()

From: Jan Kara
Date: Mon Jan 21 2013 - 15:59:58 EST


On Thu 17-01-13 15:50:29, Andrew Morton wrote:
> On Fri, 18 Jan 2013 00:46:14 +0100
> Jan Kara <jack@xxxxxxx> wrote:
>
> > On Thu 17-01-13 13:39:17, Andrew Morton wrote:
> > > On Thu, 17 Jan 2013 22:04:42 +0100
> > > Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > > ...
> > > >
> > > > So I played a bit with this. To make things easier for me I added
> > > > artificial mdelay(len*10) (effectively simulating console able to print 100
> > > > characters per second) just after call_console_drivers() so that I can
> > > > trigger issues even on a machine easily available to me. Booting actually
> > > > doesn't trigger any problems because there aren't enough things happening
> > > > in parallel on common machine during boot but
> > > > echo t >/proc/sysrq-trigger &
> > > > for i in /lib/modules/3.8.0-rc3-0-default/kernel/fs/*/*.ko; do
> > > > name=`basename $i`; name=${name%.ko}; modprobe $name
> > > > done
> > > > easily triggers the problem (as modprobe uses both RCU & IPIs to signal all
> > > > CPUs).
> > > >
> > > > Adding
> > > > touch_nmi_watchdog();
> > > > touch_all_softlockup_watchdogs();
> > > > rcu_cpu_stall_reset();
> > >
> > > I'm not sure that touch_all_softlockup_watchdogs() is needed?
> > > touch_nmi_watchdog() itself calls touch_softlockup_watchdog().
> > It is. I've tried without it and the machine died a horrible death
> > because softlockup reports added to already too heavy printk traffic. The
> > problem is that CPU doing printing cannot handle IPIs thus if someone calls
> > e.g. smp_call_function_many() that function will spin waiting for IPIs on
> > all CPUs to finish. And that doesn't happen until printing is done so
> > CPU doing smp_call_function_many() gets locked up as well.
>
> erk. I trust we'll have a nice comment explaining this mechanism ;)
So I was testing the attached patch which does what we discussed. The bad
news is I was able to trigger a situation (twice) when suddently sda
disappeared and thus all IO requests failed with EIO. There is no trace of
what's happened in the kernel log. I'm guessing that disabled interrupts on
the printing CPU caused scsi layer to time out for some request and fail the
device. So where do we go from here?

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR