Re: no DRQ after issuing WRITE was Re: 2.4.23-uv3 patch set released

From: Linus Torvalds
Date: Sat Jan 03 2004 - 14:05:43 EST




On Sat, 3 Jan 2004, Daniel Tram Lux wrote:
>
> I tried setting the timeout up as a first fix, it also decreased the
> frequency of the error,
> however it did not get rid of the error.

That is scary beyond belief.

Basically you doubled your timeout, and you certainly _should_ have seen
at least one more status read from the extra 50 msec you waited. And since
your patch (which fixes it) only adds _one_ status read, that implies that
the extra 50 msec timeout didn't get a single time through the loop.

> The device the error occurs with is a cf card. The error also occurs
> much more frequently in 2.4.23 than in 2.4.20 (but it can be provoked in
> 2.4.20). Neither use the preemption patch and both are from kernel.org.
> The platform is based on an AMD Elan processor which is a 486 compatible
> processor, running at 133 Mhz. The IDE subsytem does not use any extra
> drivers and is not a PCI ide chipset.

Yes, that path is only used for PIO writes, so it's clearly not a
high-performance IDE setup. Adn yes, I guess with a slow enough CPU and
enough interrupt load, you literally could spend 50ms just handling irqs.
Still, that is pretty damn scary. But I guess the load:

> ... there is a flood ping running and the machine is being
> flood pinged + there is traffic on three serial ports (RS485).

is pretty extreme.

> I saw two possibilities, either disabeling the interrupts while first
> reading the status and then checking the timeout, after which the
> interrupts would be enabled again. Or to just make one extra check after
> the timout has expired because that is cheaper than returning, failing
> and then resetting the drive. After I applied my patch (using the
> 5*HZ/100 timeout) my test ran for a full weekend without giving the
> timeout error.

Ok, I think I'm convinced. That loop and the IDE usage of interrupt
enables is just crap. I don't think your addition is very pretty, but the
alternative is to rewrite the loop to be sane, which isn't going to happen
in a stable kernel.

How about a slightly more minimal patch, though? Ie does this work for
you?

Linus

----
===== drivers/ide/ide-iops.c 1.18 vs edited =====
--- 1.18/drivers/ide/ide-iops.c Wed Jun 11 18:23:09 2003
+++ edited/drivers/ide/ide-iops.c Sat Jan 3 10:54:21 2004
@@ -647,6 +647,15 @@
timeout += jiffies;
while ((stat = hwif->INB(IDE_STATUS_REG)) & BUSY_STAT) {
if (time_after(jiffies, timeout)) {
+ /*
+ * One last read after the timeout in case
+ * heavy interrupt load made us not make any
+ * progress during the timeout..
+ */
+ stat = hwif->INB(IDE_STATUS_REG);
+ if (!(stat & BUSY_STAT))
+ break;
+
local_irq_restore(flags);
*startstop = DRIVER(drive)->error(drive, "status timeout", stat);
return 1;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/