Re: no DRQ after issuing WRITE was Re: 2.4.23-uv3 patch set released

From: Daniel Tram Lux
Date: Sat Jan 03 2004 - 14:29:33 EST


Linus Torvalds wrote:

On Sat, 3 Jan 2004, Daniel Tram Lux wrote:


I tried setting the timeout up as a first fix, it also decreased the frequency of the error,
however it did not get rid of the error.



That is scary beyond belief.

Basically you doubled your timeout, and you certainly _should_ have seen at least one more status read from the extra 50 msec you waited. And since your patch (which fixes it) only adds _one_ status read, that implies that the extra 50 msec timeout didn't get a single time through the loop.



The device the error occurs with is a cf card. The error also occurs
much more frequently in 2.4.23 than in 2.4.20 (but it can be provoked in
2.4.20). Neither use the preemption patch and both are from kernel.org.
The platform is based on an AMD Elan processor which is a 486 compatible
processor, running at 133 Mhz. The IDE subsytem does not use any extra
drivers and is not a PCI ide chipset.



Yes, that path is only used for PIO writes, so it's clearly not a high-performance IDE setup. Adn yes, I guess with a slow enough CPU and enough interrupt load, you literally could spend 50ms just handling irqs. Still, that is pretty damn scary. But I guess the load:



... there is a flood ping running and the machine is being flood pinged + there is traffic on three serial ports (RS485).



is pretty extreme.

These systems are going to Texas and I am in Denmark, so I wanted to be sure that the
system also works under heavy load. This is not the typical use however ;-).




I saw two possibilities, either disabeling the interrupts while first
reading the status and then checking the timeout, after which the
interrupts would be enabled again. Or to just make one extra check after
the timout has expired because that is cheaper than returning, failing
and then resetting the drive. After I applied my patch (using the
5*HZ/100 timeout) my test ran for a full weekend without giving the
timeout error.



Ok, I think I'm convinced. That loop and the IDE usage of interrupt
enables is just crap. I don't think your addition is very pretty, but the alternative is to rewrite the loop to be sane, which isn't going to happen in a stable kernel.

How about a slightly more minimal patch, though? Ie does this work for you?

Linus

----
===== drivers/ide/ide-iops.c 1.18 vs edited =====
--- 1.18/drivers/ide/ide-iops.c Wed Jun 11 18:23:09 2003
+++ edited/drivers/ide/ide-iops.c Sat Jan 3 10:54:21 2004
@@ -647,6 +647,15 @@
timeout += jiffies;
while ((stat = hwif->INB(IDE_STATUS_REG)) & BUSY_STAT) {
if (time_after(jiffies, timeout)) {
+ /*
+ * One last read after the timeout in case
+ * heavy interrupt load made us not make any
+ * progress during the timeout..
+ */
+ stat = hwif->INB(IDE_STATUS_REG);
+ if (!(stat & BUSY_STAT))
+ break;
+
local_irq_restore(flags);
*startstop = DRIVER(drive)->error(drive, "status timeout", stat);
return 1;


The patch seems functionally to be equivalent, I can test it first thing on monday. I agree that the loop
should be rewritten too, it is neither clear for reading nor very accurate when it comes to measuring
the timeout.
I also was planning to test this patch with a reduced timeout of 3*HZ/100, do you think that is worth
testing?

Daniel
P.S.
I will try to make my patches more minimal in future ;-).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/