On Sat, 3 Jan 2004, Daniel Tram Lux wrote:These systems are going to Texas and I am in Denmark, so I wanted to be sure that the
I tried setting the timeout up as a first fix, it also decreased the frequency of the error,
however it did not get rid of the error.
That is scary beyond belief.
Basically you doubled your timeout, and you certainly _should_ have seen at least one more status read from the extra 50 msec you waited. And since your patch (which fixes it) only adds _one_ status read, that implies that the extra 50 msec timeout didn't get a single time through the loop.
The device the error occurs with is a cf card. The error also occurs
much more frequently in 2.4.23 than in 2.4.20 (but it can be provoked in
2.4.20). Neither use the preemption patch and both are from kernel.org.
The platform is based on an AMD Elan processor which is a 486 compatible
processor, running at 133 Mhz. The IDE subsytem does not use any extra
drivers and is not a PCI ide chipset.
Yes, that path is only used for PIO writes, so it's clearly not a high-performance IDE setup. Adn yes, I guess with a slow enough CPU and enough interrupt load, you literally could spend 50ms just handling irqs. Still, that is pretty damn scary. But I guess the load:
... there is a flood ping running and the machine is being flood pinged + there is traffic on three serial ports (RS485).
is pretty extreme.
The patch seems functionally to be equivalent, I can test it first thing on monday. I agree that the loop
I saw two possibilities, either disabeling the interrupts while first
reading the status and then checking the timeout, after which the
interrupts would be enabled again. Or to just make one extra check after
the timout has expired because that is cheaper than returning, failing
and then resetting the drive. After I applied my patch (using the
5*HZ/100 timeout) my test ran for a full weekend without giving the
timeout error.
Ok, I think I'm convinced. That loop and the IDE usage of interrupt
enables is just crap. I don't think your addition is very pretty, but the alternative is to rewrite the loop to be sane, which isn't going to happen in a stable kernel.
How about a slightly more minimal patch, though? Ie does this work for you?
Linus
----
===== drivers/ide/ide-iops.c 1.18 vs edited =====
--- 1.18/drivers/ide/ide-iops.c Wed Jun 11 18:23:09 2003
+++ edited/drivers/ide/ide-iops.c Sat Jan 3 10:54:21 2004
@@ -647,6 +647,15 @@
timeout += jiffies;
while ((stat = hwif->INB(IDE_STATUS_REG)) & BUSY_STAT) {
if (time_after(jiffies, timeout)) {
+ /*
+ * One last read after the timeout in case
+ * heavy interrupt load made us not make any
+ * progress during the timeout..
+ */
+ stat = hwif->INB(IDE_STATUS_REG);
+ if (!(stat & BUSY_STAT))
+ break;
+
local_irq_restore(flags);
*startstop = DRIVER(drive)->error(drive, "status timeout", stat);
return 1;