Re: [PATCH v2] RDMA/cma: Make CM response timeout and # CM retries configurable

From: Doug Ledford
Date: Thu Jun 13 2019 - 12:37:16 EST


On Thu, 2019-06-13 at 08:28 -0700, Bart Van Assche wrote:
> On 6/13/19 7:25 AM, Doug Ledford wrote:
> > So, to revive this patch, what I'd like to see is some attempt to
> > actually quantify a reasonable timeout for the default backlog
> > depth,
> > then the patch should actually change the default to that
> > reasonable
> > timeout, and then put in the ability to adjust the timeout with
> > some
> > sort of doc guidance on how to calculate a reasonable timeout based
> > on
> > configured backlog depth.
>
> How about following the approach of the SRP initiator driver? It
> derives
> the CM timeout from the subnet manager timeout. The assumption
> behind
> this is that in large networks the subnet manager timeout has to be
> set
> higher than its default to make communication work. See also
> srp_get_subnet_timeout().

Theoretically, the subnet manager needs a longer timeout in a bigger
network because it's handling more data as a single point of lookup for
the entire subnet. Individual machines, on the other hand, have the
same backlog size (by default) regardless of the size of the network,
and there is no guarantee that if the admin increased the subnet
manager timeout, that they also increased the backlog queue depth size.
So, while I like things that auto-tune like you are suggesting, the
problem is that the one item does not directly correlate with the
other.

--
Doug Ledford <dledford@xxxxxxxxxx>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57
2FDD

Attachment: signature.asc
Description: This is a digitally signed message part