Re: [RFC][PATCH -next] pstore: replace spin_lock withspin_trylock_irqsave in panic path

From: Don Zickus
Date: Wed Sep 28 2011 - 10:09:55 EST


On Tue, Sep 27, 2011 at 03:46:08PM -0400, Seiji Aguchi wrote:
> Hi,
>
> >Yes we care - saving panic data is most likely the single most important
> >thing that pstore does. I just have severe doubts that it will actually
> >save anything useful if we just blindly continue if we can't get the lock.
>
> I agree with Tony. We may not get useful information if pstore just blindly continues
> while other cpus are running.
>
> >Is this patch based on a real-life case of a system deadlocking? I'd
> >like to know if we are just talking around the theoretical case that
> >the lock may be held at panic time - or something that has actually been
> >seen in real life.
>
> This patch is _not_ based on real-life case. I would like to avoid potential deadlock.
>
> If Don disagrees to my "return" code, I have another idea which moves pstore_dump() behind smp_send_stop().
> smp_send_stop() stops other cpus by sending IPI.
> So pstore can continue reliably and get useful information by just busting spinlock.

Yeah, Vivek had a similar idea to have the common panic path mimic what
they do with kdump, stop all the cpus except for the crashing one, to
serialize the crashing path. This would allow us to more easily bust
spinlocks without worrying about what the other cpus are doing.

The kdump solution involves using NMI whereas smp_send_stop (on x86)
avoids it because of past issues and instead uses the IRQ line. This
won't work if pstore_dump uses a spin_try_lock_irqsave() because the IRQ
line will be disable and never get the smp_send_stop() message (unless I
am reading the code wrong).

[reads the kernel/panic.c code] oh, I see this already exists, you would
just move the smp_send_stop() command up a couple lines of code.

[Side note] perhaps we should change the behaviour of smp_send_stop to use
NMI and create a blacklist of machines to use the IRQ line instead. I
assume the list of broken machines is small as Red Hat has been kdumping
kernels since 2.6.18 with little evidence that machines were failing
because NMI wasn't working properly.


Cheers,
Don

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/