Re: framebuffer corruption due to overlapping stp instructions on arm64

From: Robin Murphy
Date: Mon Aug 06 2018 - 08:42:27 EST


On 06/08/18 11:25, Mikulas Patocka wrote:
[...]
None of this explains why some transactions fail to make it across
entirely. The overlapping writes in question write the same data to
the memory locations that are covered by both, and so the ordering in
which the transactions are received should not affect the outcome.

You're right that the corruption couldn't be explained just by reordering
writes. My hypothesis is that the PCIe controller tries to disambiguate
the overlapping writes, but the disambiguation logic was not tested and it
is buggy. If there's a barrier between the overlapping writes, the PCIe
controller won't see any overlapping writes, so it won't trigger the
faulty disambiguation logic and it works.

Could the ARM engineers look if there's some chicken bit in Cortex-A72
that could insert barriers between non-cached writes automatically?

I don't think there is, and even if there was I imagine it would have a pretty hideous effect on non-coherent DMA buffers and the various other places in which we have Normal-NC mappings of actual system RAM.

I observe these kinds of corruptions:
- failing to write a few bytes

That could potentially be explained by the reordering/atomicity issues Matt mentioned, i.e. the load is observing part of the store, before the store has fully completed.

- writing a few bytes that were written 16 bytes before
- writing a few bytes that were written 16 bytes after

Those sound more like the interconnect or root complex ignoring the byte strobes on an unaligned burst, of which I think the simplistic view would be "it's broken".

FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and it's still happily flickering pixels in the corner of the console after nearly an hour (in parallel with some iperf3 just to ensure plenty of PCIe traffic). I would strongly suspect this issue is particular to Armada 8k, so its' probably one for the Marvell folks to take a closer look at - I believe some previous interconnect issues on those SoCs were actually fixable in firmware.

Robin.