Re: CONFIG_ARM_DMA_MEM_BUFFERABLE and readl/writel weirdness

From: Saravana Kannan
Date: Tue Mar 08 2011 - 23:37:54 EST


Discussed this further with our resident ARM expert. The rest of the email is me understanding his comments and rephrasing them to suit this thread.

On 03/03/2011 02:11 AM, Catalin Marinas wrote:
On Thu, 2011-03-03 at 07:49 +0000, Saravana Kannan wrote:
On 03/02/2011 12:39 AM, Russell King - ARM Linux wrote:
On Tue, Mar 01, 2011 at 05:23:15PM -0800, Saravana Kannan wrote:
If I'm not missing some magic, this would mean that
"CONFIG_ARM_DMA_MEM_BUFFERABLE" determines if readl(s)/writel(s) get to
have a built in mb() or not.
[...]
I think you misunderstand what's going on. IO accesses are always ordered
with respect to themselves. The barriers are there to ensure ordering
between DMA coherent memory (normal non-cached memory) and IO accesses
(device).

Unfortunately this is not correct. The ARM spec doesn't guarantee that
all IO accesses should be ordered with respect to themselves. It only
requires that the ordering should be guaranteed at least within a 1KB
region.

That's because the CPU does not have control of the delays on various
buses. But a device connected to the same bus receives the accesses in
order.

From the ARM spec, at the least, we have established that IO accesses are not guaranteed to be ordered with respect to all IO accesses. So, I kindly request that we no longer perpetuate the incorrect statement that IO accesses are ordered wrt to all IO accesses.

The only continuing disagreement is on whether the minimum limit is 1KB or "the entire address space of a given device". More on that below.

And the most critical point is hidden in a comment that goes:
"The size of a memory mapped peripheral, or a block of memory, is
IMPLEMENTATION DEFINED, but is not smaller than 1KByte."

I guess most of the confusion is due to the ARM spec not being very
obvious about the 1KB limitation.

What that means is that the hardware shouldn't have two different buses
(possibly with different delays) within a 1KB range.

Rephrasing comments from our resident expert:
This 1KB limit is not really tied to there being different "speeds" on the buses. Rather it is a fundamental topological issue. It has to do with how the "core" interleaves the interconnect. If the interleave is at the 1KB level, there are two different bus interfaces at the 1KB interleave. So even if the downstream "device" has a single *input* bus, the processor and the interconnect have two different paths to get to that device, if the address space spans 1KB. So there is no guaranteed order across the 1KB boundary, even if in concept the two sides of the boundary are talking to the same "device".

Even if accesses to all peripherals are ordered, you still cannot
guarantee that a writel() would change the state of a device (and that's
specific to all architectures). Sometimes if you want to make sure the
device state changed you need a readl() back.

Yes, a device might take much longer to change the state or process the command after the write completes. I realize that has nothing to do with memory barriers.

So, going back to my point, I think it's wrong for
CONFIG_ARM_DMA_MEM_BUFFERABLE to control how stuff unrelated to DMA behaves.

In an ideal world, all driver authors know what memory ordering is and
add the necessary barriers. But since that's not the case, the only way
to get ordering between Normal Non-cacheable access (DMA buffer) and
Device access (via writel) is to add the mb() in the I/O accessors.

This has been discussed at length on several occasions on linux-arch and
LKML. We don't have other solution since adding barriers to drivers
wasn't found feasible by others. Of course, if you can optimised your
driver to use the relaxed accessors and add explicit barriers.

I think my comment for clean up was misunderstood to be more that what I intend. I will address this at the end of this email.

I have also encountered a few people who kept went "but readl/writel was
recently changed to add mem barriers, so we can all remove the mb()s in
our driver (unrelated to DMA) code".

But why did they have those barriers around I/O accessors in the first
place? As you say, nothing related to DMA. If you access two different
devices and want to ensure an ordering of the state changes, I doubt a
barrier would help. Most likely you need a read back from the device.

These barriers were needed because the device spanned more than 1KB of address space. See below.

That would have made their code incorrect for two reasons:
1. readl/writel doesn't always have a mem barrier because of config that
can be turned off.
2. In cases where readl/writel didn't have mb(), there is not enough
ordering guarantee without an explicit mb().

See my comment above, what do they try to achieve by using mb() around
already ordered I/O accessors?

I think as a community, we should stop saying that readl/writel ensures
ordering with respect to all IO accesses. It doesn't even guarantee
ordering within the same device (when their register regions are> 1KB).
>
It definitely guarantees ordering to the same device. The 1KB is a
minimum limit but there is no upper bound (and it should definitely
cover a single device).

Rephrasing comments from our resident expert:
That is not true. It only guarantees ordering to the 1KB boundary. That is the implementation-defined limit of what constitutes a "device". Any "device" which crosses that boundary is viewed as two separate devices from the core's perspective, and there is no guaranteed order, absent an explicit barrier operation.

The 1KB is the minimum size that the architecture allows the hardware to establish as its limit for guaranteed ordering.

It is not true that the hardware is obligated to go *above* that limit just because a given "device" happens to have address space that crosses the boundary. As soon as you have two separate interconnect ports and thus two separate paths to get to the device, because it crossed the 1KB boundary, then you have the potential of being out-of-order. Software is obligated to use explicit barriers on accesses that cross that threshold, regardless of whether *physically* the accesses end up at the same downstream "device".

After reading the above, please let me know if a patch to decouple the
"readl/writel with builtin mb()" from CONFIG_ARM_DMA_MEM_BUFFERABLE
would be accepted. If so, I can go ahead and send it out soon.

No, I don't think this would be acceptable. I can point you to past
discussion where Linus stated that Normal Non-cacheable memory and I/O
accesses should be ordered.

What you can do actually is make sure that all architectures support the
relaxed accessors and change the drivers to use these instead.

There were a couple of reason for starting this thread:
* Check if my understanding of CONFIG_ARM_DMA_MEM_BUFFERABLE's effect on readl/writl were correct and if I was missing anything.
Result: Yes, I was missing some stuff about the config affecting DMA memory mapping, but I did understand the point about having mb()s in readl/writel correctly.

* Bring to attention that the ARM spec doesn't guarantee all IO accesses are ordered wrt each other and have the community update it's understanding of IO access ordering.
Result: It think we agree on this update. At least Russell seems to. The ARM spec is very clear on this. The only part we disagree on is the 1KB limit.

* If my understanding about how CONFIG_ARM_DMA_MEM_BUFFERABLE affects readl/writel was correct and we agree on the IO access ordering update, then it would show that mb()s matter for more than just DMA. In which case, I wanted to split that single config into two and clean up the names while still leaving CONFIG_ARM_DMA_MEM_BUFFERABLE functionally the same.
Result?: After the above discussion, I hope the community agrees that it's reasonable to reorganize and rename the config to make the config more understandable and less confusing to developers. Less confusion == less bugs in my opinion. I will send out a patch sometime this week to show what I mean.

But modifying the behaviour of DMA was not my intention. I will of course read the discussion you mention (will ping you if I can't find it) to get a better understanding of the DMA stuff.

Catalin,

To summarize,
Our internal ARM expert continues to state that the 1KB is the minimum limit even within a single device. Judging his expertise based on my previous interactions with him, I'm inclined to believe him. I think at the least we should give him the benefit of the doubt. Looks like a definitive way to settle this disagreement would be to check with the author of those updates/comment I referred to in the ARMv7 ARM.

If the 1KB limit is correct, it would certainly help improving the correctness of drivers for the ARM kernel. If the 1KB limit is wrong, we don't lose anything.

So, if you don't mind, could you please check with the author of those updates and comments in the ARMv7 ARM?

Thanks,
Saravana

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/