Re: in 2.6.23-rc3-git7 in do_cciss_intr

From: Jens Axboe
Date: Wed Nov 19 2008 - 03:54:29 EST


On Tue, Nov 18 2008, Randy Dunlap wrote:
> Randy Dunlap wrote:
> > Randy Dunlap wrote:
> >> Miller, Mike (OS Dev) wrote:
> >>>> -----Original Message-----
> >>>> From: Randy Dunlap [mailto:randy.dunlap@xxxxxxxxxx]
> >>>> Sent: Thursday, September 25, 2008 3:40 PM
> >>>> To: scsi
> >>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
> >>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
> >>>>
> >>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
> >>>>
> >>>>> Jens Axboe wrote:
> >>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
> >>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
> >>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
> >>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
> >>>> <do_cciss_intr+2509>
> >>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627 SA5_fifo_full
> >>>>>>>>>>>
> >>>> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:2
> >>>>>>>> 06
> >>>>>>>>>> OK ...that's confusing. It seems to be saying that
> >>>> ctrlr_info_t
> >>>>>>>>>> * was NULL. However, I can't see a way of getting into the
> >>>>>>>> fifo_full
> >>>>>>>>>> callback from do_cciss_intr ..
> >>>>>>>>>> especially not with an NULL host.
> >>>>>>>>>>
> >>>>>>>>>> James
> >>>>>>>>> That is weird. Even if we could get there fifo_full doesn't
> >>>>>>>> do anything but wait for a bit.
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> This just happened again. This time it's on 2.6.27-rc5-git3.
> >>>>>>>>
> >>>>>>>> ~Randy
> >>>>>>> Thanks Randy. I think. :)
> >>>>>>>
> >>>>>>> I'll try to recreate in my lab.
> >>>>>> This looks somewhat strange, mostly like 'c' is NULL and it's
> >>>>>> oopsing in in removeQ (I don't think Randy's analysis is
> >>>> correct in
> >>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
> >>>> cannot be
> >>>>>> NULL, it's c->prev or c->next that are NULL.
> >> This BUG: has happened (now) 5 times today. Higher frequency than usual for
> >> some reason.
> >>
> >> I enabled CCISS_DEBUG and added one printk in removeQ(). On the first call
> >
> > s/first/second/
> >
> >
> >> to removeQ(), both c->next and c->prev are NULL.
> >>
> >> Here's the kernel log output from cciss:
>
> I added a printk() in addQ() as well. Here's the new output:
>
> HP CISS Driver (v 3.6.20)
> ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54
> cciss 0000:42:08.0: PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
> command = 147
> irq = 36
> board_id = 3211103c
> cciss 0000:42:08.0: irq 87 for MSI/MSI-X
> address 0 = fdf80000
> cfg base address = 10
> cfg base address index = 0
> cfg offset = 400
> Controller Configuration information
> ------------------------------------
> Signature = CISS
> Spec Number = 1
> Transport methods supported = 0x6
> Transport methods active = 0x3
> Requested transport Method = 0x0
> Coalesce Interrupt Delay = 0x0
> Coalesce Interrupt Count = 0x1
> Max outstanding commands = 0x256
> Bus Types = 0x200000
> Server Name =
> Heartbeat Counter = 0x1672
>
>
> Trying to put board into Simple mode
> I counter got to 1 0
> Controller Configuration information
> ------------------------------------
> Signature = CISS
> Spec Number = 1
> Transport methods supported = 0x6
> Transport methods active = 0x3
> Requested transport Method = 0x0
> Coalesce Interrupt Delay = 0x0
> Coalesce Interrupt Count = 0x1
> Max outstanding commands = 0x256
> Bus Types = 0x200000
> Server Name =
> Heartbeat Counter = 0x1672
>
>
> cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
> cciss: intr_pending 8
> cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
> cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000, next=ffff88007f83e000, prev=ffff88007f83e000
> Sending 7f83e000 - down to controller
> cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
> cciss: intr_pending 8
> cciss: Read 4 back from board
> cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000, next=0000000000000000, prev=0000000000000000
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000248

Randy, can you post the debug patch you used? The above goes boom when
it attempts to remove a command that isn't on the list, the Qptr in the
last example should be empty, hence the oops. So I'd be interested in
seeing what removeQ() calls this is, I'm assuming it's this bit in
do_cciss_intr():

...
while (c->busaddr != a) {
c = c->next;
if (c == h->cmpQ)
break;
}
}
/*
* If we've found the command, take it off the
* completion Q and free it
*/
if (c->busaddr == a) {
removeQ(&h->cmpQ, c);
if (c->cmd_type == CMD_RWREQ) {
complete_command(h, c, 0);
...

If so, what part of the c lookup are you hitting - the on that does:

c = h->cmd_pool + a2;

or the c->busaddr check that his shown above?

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/