Re: 2.6.14-rc5-mm1

From: Andrew Morton
Date: Mon Oct 24 2005 - 16:17:19 EST


Badari Pulavarty <pbadari@xxxxxxxxx> wrote:
>
> On Mon, 2005-10-24 at 01:48 -0700, Andrew Morton wrote:
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-rc5/2.6.14-rc5-mm1/
> >
>
> I can't seem to keep my AMD64 machine up with 2.6.14-rc5-mm1.
> Keep running into following. qlogic driver problem ?

I don't know why the qlogic driver has suddenly started doing this - were
there any earlier messages which might tell us? Is it possible to increase
the debugging level?

I can spot one bug in there, but the lockup is just a symptom.

There are no qlogic changes in 2.6.14-rc5-mm1.

> Thanks,
> Badari
>
> NMI Watchdog detected LOCKUP on CPU 0
> CPU 0
> Modules linked in: qlogicfc qla2300 qla2200 qla2xxx firmware_class
> ohci_hcd hw_r andom usbcore dm_mod
> Pid: 4414, comm: modprobe Not tainted 2.6.14-rc5-mm1 #3
> RIP: 0010:[<ffffffff80401d94>] <ffffffff80401d94>{.text.lock.spinlock
> +34}

That's a corrupted spinlock.

> RSP: 0000:ffff81019da91c28 EFLAGS: 00000086
> RAX: 0000000000000000 RBX: ffff81017991f288 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000286 RDI: ffff81017991f290
> RBP: ffff81019da91c28 R08: 0000000000000034 R09: 0000000000000000
> R10: ffff81019da91c08 R11: 0000000000000000 R12: ffff81017991f290
> R13: 00000000fffffff4 R14: ffff81017991c000 R15: 0000000000000100
> FS: 00002aaaaade36e0(0000) GS:ffffffff80649800(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000836018 CR3: 00000001bdf2a000 CR4: 00000000000006e0
> Process modprobe (pid: 4414, threadinfo ffff81019da90000, task
> ffff81019e3bb500)
> Stack: ffff81019da91cb8 ffffffff80400dec ffff81019da91c78
> ffffffff80142ed0
> 0000000000000000 0000000000000000 0000000000000001
> 0000000000000001
> ffff81017991c000 0000000000000100
> Call Trace:<ffffffff80400dec>{wait_for_completion+28}
> <ffffffff80142ed0>{group_s end_sig_info+128}
> <ffffffff80142f89>{kill_proc_info+73}
> <ffffffff8805753c>{:qla2xxx:qla2x00 _free_device+188}
> <ffffffff8805880b>{:qla2xxx:qla2x00_probe_one+4187}
> <ffffffff80132ce0>{set_cpus_allowed+336}
> <ffffffff8012dca3>{__wake_up_com mon+67}
> <ffffffff8029741a>{kobject_get+26}
> <ffffffff8807c01d>{:qla2200:qla2200_pr obe_one+13}
> <ffffffff802a516c>{pci_device_probe+252}
> <ffffffff80308d29>{driver_probe_ device+89}
> <ffffffff80308da0>{__driver_attach+0}
> <ffffffff80308de2>{__driver_attach+ 66}
> <ffffffff803086af>{bus_for_each_dev+79}
> <ffffffff80308bbc>{driver_attach+ 28}
> <ffffffff80308108>{bus_add_driver+136}
> <ffffffff80308f0c>{driver_register +76}
> <ffffffff802a4ebe>{pci_register_driver+158}
> <ffffffff8804d010>{:qla2200:q la2200_init+16}
> <ffffffff80154072>{sys_init_module+274}
> <ffffffff8010dd2e>{system_call+12 6}

qla2x00_probe_one() has called qla2x00_free_device() and
qla2x00_free_device() has locked up in
wait_for_completion(&ha->dpc_exited);

Presumably, ha->dpc_exited is not initialised yet.

The first `goto probe_failed' in qla2x00_probe_one() will cause
qla2x00_free_device() to run wait_for_completion() against an uninitialised
completion struct. Because ha->dpc_pid will be >= 0.

This patch might fix the lockup, but if so, qla2x00_iospace_config()
failed. Please debug that a bit for us?

Andrew, this driver should be converted to use the kthread API - using
kill_proc() from within a driver to terminate a kernel thread is kinda
gross.


diff -puN drivers/scsi/qla2xxx/qla_os.c~qlogic-lockup-fix drivers/scsi/qla2xxx/qla_os.c
--- 25/drivers/scsi/qla2xxx/qla_os.c~qlogic-lockup-fix Mon Oct 24 14:14:20 2005
+++ 25-akpm/drivers/scsi/qla2xxx/qla_os.c Mon Oct 24 14:14:37 2005
@@ -1325,6 +1325,8 @@ int qla2x00_probe_one(struct pci_dev *pd
ha->brd_info = brd_info;
sprintf(ha->host_str, "%s_%ld", ha->brd_info->drv_name, ha->host_no);

+ ha->dpc_pid = -1;
+
/* Configure PCI I/O space */
ret = qla2x00_iospace_config(ha);
if (ret)
@@ -1448,7 +1450,6 @@ int qla2x00_probe_one(struct pci_dev *pd
*/
spin_lock_init(&ha->mbx_reg_lock);

- ha->dpc_pid = -1;
init_completion(&ha->dpc_inited);
init_completion(&ha->dpc_exited);

_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/