DMA cache consistency bug introduced in 2.6.28 (Was: Re: [Fdutils]Cannot format floppies under kernel 2.6.*?)

From: Alain Knaff
Date: Thu Dec 17 2009 - 12:01:09 EST


On 17/12/09 16:49, Alain Knaff wrote:
> On 17/12/09 16:43, Mark Hounschell wrote:
>> On 12/17/2009 10:35 AM, Alain Knaff wrote:
>>
>>>> Should I do more work in between?
>>>
>>> No, but make sure to look at track 0... Other tracks will still have the
>>> error, as there was nothing forcing a memory flush between track 0 and 1...
>>
>> Ok track 0
> [...]
>> 0: 0
>> 1: 0
>> 2: 0
>> 3: 4f <--
>> 4: 0
>> 5: 1
>> 6: 2
>> no disk change
>
> Yeah, that's what I meant... So the memory flusher program didn't manage to
> clear up the inconsistency...
>
> So either my theory is wrong, or the memory flusher program was not
> efficient enough.... hmmm, maybe doing some surfing in between the formats,
> or doing another kernel compilation might be a better test.
>
> Alain

Ok, so I had a look at the differences between 2.6.27.41 and 2.6.28, and
there have indeed been changes to the iommu and DMA handling code.

So I suspect that the problem may be lying here

Cc'ed Linus and kernel list on this. For Linux and the list, here's the
summary of what we are observing:

- A DMA transfer of a memory block transfers the wrong value for the first
byte of the block. All other bytes of the block are transferred correctly.
The value of the first byte turns out to be the value that this byte held
during the *previous* transfer. Just as if there was some kind of cache,
and the transfer started before that cache was refreshed with the new
values from main memory.

Example:

1. initial contents: 33 44 55 66
2. one DMA transfer is performed
3. program changes buffer to: 77 88 99 aa
4. new DMA transfer is performed => instead it transmits 33 88 99 aa
(i.e. first byte is from previous contents)

This used to work in 2.6.27.41, but broke in 2.6.28 . It doesn't happen on
all hardware though.

It does indeed seem to be related to a DMA-side cache (rather than the
processor's cache not being flushed to main memory), as doing lots of
memory intensive work (kernel compilation) between 2 and 3 doesn't fix the
problem.

In the diff between 2.6.27.41 and 2.6.28, I noticed a lot of changes in
arch/x86/kernel/amd_iommu.c and related files, could any of these have
triggered this behavior?

Any ideas, anybody?

Alain
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/