Re: BUG: [2.4.18+] IDE Race Condition

From: Ross Biro (rossb@google.com)
Date: Tue Jan 28 2003 - 12:42:21 EST


Sorry, we are not using sub-microsecond timers, we are just using the
latest-greatest-bugiest hard drives and lots of them. We can see
problems like this because we can run on so many machines at once. Even
something that has a small percentage chance of showing up, we can see
pretty quick.

We do have a bus analyzer that can time things down to about 4ns. That
has been invaluable for tracking down some of these issues.

    Ross

Andre Hedrick wrote:

>Ross,
>
>How did you create the sub-microsecond timers to profile the device/driver
>behavior? I had been working on this for a while but little success.
>This is one of the key methods to predict device failure.
>
>One of the goals of the prebuilder it find slam prebuild commands down the
>pipes to force breakages and races to show up.
>
>So you have a possible solution?
>
>Cheers,
>
>Andre Hedrick
>LAD Storage Consulting Group
>
>
>On Tue, 28 Jan 2003, Ross Biro wrote:
>
>
>
>>Easy, it happens all the time, there are just no tests in place to see it.
>>
>>We are keeping a histogram of how long every IDE DMA transfer takes
>>place. In ide_intr we record the time and set the start time in
>>ide_drive_t to 0. In ide_dma_proc, ide_dma_begin right AFTER activating
>>the dma, we store the current value of jiffies in start time in ide_drive_t.
>>
>>In both those places we check to make sure that the value of start_time
>>is sensible. In ide_dma_begin, we make sure it's 0 and in ide_dma_intr,
>>we make sure its non-zero. Because of this race condition, we often saw
>>DMAs finish before they began.
>>
>>In the normal kernel, the only thing I can see that could go wrong would
>>be that the printk
>>
>>printk("%s: ide_intr: huh? expected NULL handler on exit\n", drive->name);
>>
>>in ide_intr could be triggered. I've never seen it happen, but I
>>believe with enough effort, it could be made to occur.
>>
>> Ross
>>
>>
>>Andre Hedrick wrote:
>>
>>
>>
>>>Okay, how do you reproduce it to see the effects?
>>>
>>>On Mon, 27 Jan 2003, Ross Biro wrote:
>>>
>>>
>>>
>>>
>>>
>>>>The net effect of this race condition and the other one I spotted is
>>>>that you may see some interesting messages in your log file and you can
>>>>detect the race condition if you look for it hard enough. I don't
>>>>currently see any bad effects.
>>>>
>>>> Ross
>>>>
>>>>Ross Biro wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>There is at least one more IDE race condition in 2.4.18 and
>>>>>2.4.21-pre3. Basically the interrupt for the controller being serviced
>>>>>is left on while setting up the next command. I'm not sure how much
>>>>>trouble it can cause but it does lead to some interesting stack traces.
>>>>>
>>>>>The condition
>>>>>if (masked_irq && hwif->irq != masked_irq)
>>>>>in ide_do_request should be replaced with
>>>>>if (!masked_irq || hwif->irq != masked_irq)
>>>>>in two places.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>-
>>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>>the body of a message to majordomo@vger.kernel.org
>>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>Please read the FAQ at http://www.tux.org/lkml/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>Andre Hedrick
>>>LAD Storage Consulting Group
>>>
>>>
>>>
>>>
>>>
>
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Jan 31 2003 - 22:00:20 EST