Hi,
i am looking into silent crashes on some of our boxes running with
icp vortex controllers - I have seens CPU lockups with the gdth driver
from the icp-vortex pages but also with the kernel plain driver. While
staring at the code trying to explaint a backtrace from a
NMI Watchdog detected LOCKUP on CPU0, eip c01c877d, registers:
[...]
>>EIP; c01c877d <.text.lock.gdth+e1/154> <=====
Trace; c01c7753 <gdth_interrupt+613/620>
Trace; c01c5564 <gdth_wait+5c/ac>
Trace; c01c5726 <gdth_internal_cmd+172/1b8>
Trace; c01c825e <gdth_eh_bus_reset+18e/304>
Trace; c01b4c33 <scsi_try_bus_reset+73/f4>
Trace; c01b53b7 <scsi_unjam_host+34f/724>
Trace; c01b58fb <scsi_error_handler+16f/1cc>
Trace; c01070d4 <kernel_thread+28/38>
I think i spotted a race:
gdth.c:scsi_eh_bus_reset
4478 if (ha->hdr[i].present) {
4479 GDTH_LOCK_HA(ha, flags);
4480 gdth_polling = TRUE;
4481 while (gdth_test_busy(hanum))
4482 gdth_delay(0);
4483 if (gdth_internal_cmd(hanum, CACHESERVICE,
4484 GDT_CLUST_RESET, i, 0, 0))
4485 ha->hdr[i].cluster_type &= ~CLUSTER_RESERVED;
4486 gdth_polling = FALSE;
4487 GDTH_UNLOCK_HA(ha, flags);
4488 }
If we catch an interrupt between acquirering GDTH_LOCK_HA and
setting gdth_polling = TRUE we execute:
gdth.c:gdth_interrupt
3181 if (!gdth_polling)
3182 GDTH_LOCK_HA((gdth_ha_str *)dev_id,flags);
Now we should be stuck.
From looking at my vmstat 30 output running on the serial console
i would guess that only one "Host Drive" on the controller got stuck
but not the other - The machine only has drives on the Raid controller:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 14008 4348 4144 802828 1 0 6390 13 4828 1765 14 22 64
0 0 1 14008 4472 4148 802596 0 0 6254 23 4808 1717 8 22 70
0 0 0 14008 4420 4128 802892 0 0 6186 8 4747 1721 7 21 72
3 0 0 14008 4444 4148 803084 0 0 6195 13 4709 1717 6 20 74
0 1 1 14008 6104 4072 803216 0 0 5759 9 4548 1690 5 18 77
0 1 0 14008 4428 4080 804840 0 0 56 8 490 332 2 2 96
0 1 0 14008 6676 4080 806172 0 0 51 13 726 472 6 3 91
1 1 0 14008 4848 4084 807900 0 0 51 4 826 464 9 3 87
0 1 0 14008 5700 4088 808860 0 0 51 8 758 417 9 3 87
0 1 0 14008 4520 4108 810976 1 0 91 8 722 399 9 3 88
Adapter 0: Host Drive 0: resetted locally
NMI Watchdog detected LOCKUP on CPU0, eip c01c877d, registers:
CPU: 0
EIP: 0010:[<c01c877d>] Not tainted
[...]
I dont think that this specific race can be held responsible for
the lockup i saw as it locked up in gdth_wait executing gdth_interrupt.
I am overlooking some facts ?
Flo
-- Florian Lohoff flo@rfc822.org +49-5201-669912 Heisenberg may have been here.
This archive was generated by hypermail 2b29 : Tue May 07 2002 - 22:00:19 EST