Re: [RFC/PATCH] Kdump: Disabling PCI interrupts in capture kernel

From: Vivek Goyal
Date: Sat Jun 04 2005 - 05:44:36 EST


Quoting Greg KH <greg@xxxxxxxxx>:

> On Fri, Jun 03, 2005 at 04:55:24PM +0530, Vivek Goyal wrote:
> > Hi,
> >
> > In kdump, sometimes, general driver initialization issues seems to be
> cropping
> > in second kernel due to devices not being shutdown during crash and these
>
> > devices are sending interrupts while second kernel is booting and drivers
> are
> > not expecting any interrupts yet.
>
> What are the errors you are seeing?

We have observed mainly two kind of problems.

1. Devices A and B are sharing the same irq line. Driver for device A is
initializing and opens the respective irq line(request_irq()).
Device B raises an interrupt. Interrupt handler for device A denies its
not my irq and there is no interrupt handler for device B yet. Device B
keeps the irq line asserted and kernel sees a flood of interrupts and
finally kernel disables the irq line. Now device A's interrupt also stop
coming and driver initialization for device A fails.

We have observed this especially in case of aic7xxx driver and one sample
log is attached with the mail.

2. Second problem is that driver is not expecting an interrupt the moment it
calls request_irq(). Otherwise initialization fails. We saw this problem in
IPS driver initialization failure. (Bugme 4573). Though this seenms to be
more of a driver bug and can be fixed separately.


> How would the drivers be able to be getting interrupts delivered to them
> if they haven't registered the irq handler yet?

True that drivers will not get interrupts until they register request_irq().
In the above mentioned aic7xxx dirver failure, this driver starts getting
interrupts the moment it calls request_irq(). And the fact is that these
interrupts are being generated from some other sharing the same irq line.


>
> > In some cases, we are observing a storm of interrupts while second kernel
>
> > is booting and kernel disables that irq line. May be the case of stuck
> irq
> > line because it is shared level triggered irq and there is no driver
> > loaded for the device.
> >
> > So, we need something generic which disables interrupt generation from
> device
> > until a driver has been registered for that device and driver is ready to
>
> > receive the interrupts. PCI specifications (ver 2.3 onwards), have
> introduced
> > interrupt disable bit in command register to disable interrupt generation
> > from the device. This can become handy here. In capture kernel, traverse
> all
> > the PCI devices, disable interrupt generation. Enable the interrupt
> generation
> > back once the driver for that device registers. May be after the probe
> handler
> > has run. In probe handler, driver can reset the device or register for irq
> so
> > that it can handle any interrupt from the device after that.
> >
> > Greg mentioned that there are reasons that we can not disable all pci
> > interrupts. Meanwhile I am going through archives to find more about it.
>
> You recalled the conversation in the next paragraph:
>
> > In previous conversations, Alan Stern had raised the issue of console
> also
> > not working if interrupts are disabled on all the devices. I am not sure
> > but this should be working at least for serial consoles and vga text
> consoles.
> > May be sufficient to capture the dump.
>
> That's the main objection a lot of people had to this kind of patch,
> last time it was discussed.


It can very well be a big concern when it comes to booting normal kernel but
I believe that for capture kernel (kernel booting after a crash), it might
not be that big an issue. Anyway, capture kernel is booting only to capture
and save the dump to some pre configured storage media. Most probably it
will be done from an initrd or an init script and system will boot into
production kernel back enabling all the consoles.

Alternatively, pci device id of console can be passed to capture kernel
through command line and disabling interrupts can be skipped on console.
This is little hackish though but crash dump is a peculiar case.


Thanks
Vivek

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/