Re: [PATCH 0/4 v6] Avoid softlockups in console_unlock()

From: Jan Kara
Date: Sun Sep 22 2013 - 17:48:51 EST


On Thu 19-09-13 14:26:27, Andrew Morton wrote:
> On Thu, 5 Sep 2013 17:46:12 +0200 Jan Kara <jack@xxxxxxx> wrote:
> > On Fri 23-08-13 12:58:22, Andrew Morton wrote:
> > > On Fri, 23 Aug 2013 21:48:36 +0200 (CEST) Jiri Kosina <jkosina@xxxxxxx> wrote:
> > >
> > > > > > We have customers (quite a few of them actually) which have machines with
> > > > > > lots of SCSI disks attached (due to multipath etc.) and during boot when
> > > > > > these disks are discovered and partitions set up quite some printing
> > > > > > happens - multiplied by the number of devices (1000+) it is too much for a
> > > > > > serial console to handle quickly enough. So these machines aren't able to
> > > > > > boot with serial console enabled.
> > > > >
> > > > > It sounds like rather a corner case, not worth mucking up the critical
> > > > > core logging code.
> > > >
> > > > Andrew, I have to admit I don't understand this argument at all.
> > >
> > > Of course you do. print should be simple, robust and have minimum
> > > dependency on other kernel parts.
> > >
> > > I suppose that if you make the proposed
> > > /proc/sys/kernel/max_printk_chars settable from the boot command line
> > > and default to zero, any risks are minimized.
> > That's easy enough to do so if it makes you happy I'll go for that.
> > During my vacation I was also thinking how I could address some of your
> > concerns. The only idea I found plausible was a scheme where CPU that
> > wants to stop printing would raise some flag but still keep printing
> > releasing and reacquiring the console_sem from time to time. In
> > console_trylock_for_printk() we would block waiting for console_sem
> > if we see the flag raised.
> >
> > This way we would be guaranteed someone has really taken over printing
> > before we leave console_unlock(). We would still need to use irq_work so
> > that we have someone to take over printing in case printk storm has filled
> > our dmesg buffer and we are now slowly getting it out to the console.
> >
> > So all in all this would be a bit more complex than my current solution
> > (additional flag and some logic around it). The advantage is that we would
> > rely on irq_work only to achieve reasonable irq latency but it won't be
> > necessary for getting printk out to console. If this addresses your
> > concerns better I could try implementing that. Thoughts?
> >
>
> What driver is in use here, anyway?
> Is it drivers/tty/serial/8250/8250_early.c?
It is 8250 serial (at least in report I have checked now). However I don't
think it's early console - the lockup reports contain traces e.g. with
serial8250_console_putchar() which is a standard serial console (but I may
be wrong here).

> I'm not sure that I fully undertand the problem yet. You say "These
> patches avoid softlockups when a CPU gets caught in console_unlock()
> for a long time during heavy printing from other CPU". But *why* is
> the printing CPU holding the lock for so long? A single printk won't
> take a huge amount of time, so that CPU must be spinning waiting for
> previously-printk'd characters to drain?
Yes. We want to print a message, we enter concole_unlock() on some CPU
and start sending characters via serial console. While we are doing that,
other CPUs come and add new messages to the kernel buffer. We also have to
send those to serial console which even prolongs the window during which
other CPUs have time to add new messages to the buffer. So the time we can
spend in console_unlock() doing printing is currently unbounded - while
other CPUs keep coming with messages, current holder of console_sem keeps
printing.

> Or something else?
> See, if we can get to the bottom of this then perhaps we can pace the
> printing CPU so that it somehow twiddles thumbs outside console_lock().
I've actually tried to do this in some of my previous attempts to tackle
the problem. The problem I hit with that approach was that printk() itself
is often called with interrupts disabled. So to be able to reenable
interrupts within reasonable time, we have to return from such printk()
first.

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/