Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

From: Tom Talpey
Date: Fri Apr 09 2021 - 13:49:30 EST


On 4/9/2021 12:27 PM, Haakon Bugge wrote:


On 9 Apr 2021, at 17:32, Tom Talpey <tom@xxxxxxxxxx> wrote:

On 4/9/2021 10:45 AM, Chuck Lever III wrote:
On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@xxxxxxxxxx> wrote:

On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.
RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.

I'd love to see RO be used more, it was always something the RDMA
specs supported and carefully architected for. My only concern is
that it's difficult to get right, especially when the platforms
have been running strictly-ordered for so long. The ULPs need
testing, and a lot of it.

We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.
On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.

My belief is that the biggest risk is from situations where completions
are batched, and therefore polling is used to detect them without
interrupts (which explicitly). The RO pipeline will completely reorder
DMA writes, and consumers which infer ordering from memory contents may
break. This can even apply within the provider code, which may attempt
to poll WR and CQ structures, and be tripped up.
You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?

Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.

Something like this...

RDMA Write arrives
PCIe RO Write for data
PCIe RO Write for data
...
RDMA Write arrives
PCIe RO Write for data
...
RDMA Send arrives
PCIe RO Write for receive data
PCIe RO Write for receive descriptor

Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.

I wasn't aware that a strongly-ordered PCIe Write will ensure that
prior relaxed-ordered writes went first. If that's the case, I'm
fine with it - as long as the providers are correctly coded!!

PCIe interrupt (flushes RO pipeline for all three ops above)

Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.

And the MSI-X write likewise, to avoid spurious interrupts.

Ok, and yes agreed the same principle would apply.

Is there any implication if a PCIe switch were present on the
motherboard? The switch is allowed to do some creative routing
if the operation is relaxed, correct?

Tom.

Thxs, Håkon


RPC/RDMA polls CQ
Reaps receive completion

RDMA Send arrives
PCIe RO Write for receive data
PCIe RO write for receive descriptor
Does *not* interrupt, since CQ not armed

RPC/RDMA continues to poll CQ
Reaps receive completion
PCIe RO writes not yet flushed
Processes incomplete in-memory data
Bzzzt

Hopefully, the adapter performs a PCIe flushing read, or something
to avoid this when an interrupt is not generated. Alternatively, I'm
overly paranoid.

Tom.

The Mellanox adapter, itself, historically has strict in-order DMA
semantics, and while it's great to relax that, changing it by default
for all consumers is something to consider very cautiously.

Still, obviously people should test on the platforms they have.

Yes, and "test" be taken seriously with focus on ULP data integrity.
Speedups will mean nothing if the data is ever damaged.
I agree that data integrity comes first.
Since I currently don't have facilities to test RO in my lab, the
community will have to agree on a set of tests and expected results
that specifically exercise the corner cases you are concerned about.
--
Chuck Lever