Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message

From: Haakon Bugge

Date: Wed Oct 15 2025 - 07:38:25 EST


Hi Jason and Jake,

> On 13 Oct 2025, at 16:04, Haakon Bugge <haakon.bugge@xxxxxxxxxx> wrote:

[snip]

> My take is that the VF in question here gets whacked and that the MAD timeout handling does not resonate well with how CMA handles them.

I was able to simulate a whacked VF by setting the CMA max retries to one and once in a while, skip sending of the MAD:
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9b471548e7ae1..43eff54151830 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -45,7 +45,7 @@ MODULE_DESCRIPTION("Generic RDMA CM Agent");
MODULE_LICENSE("Dual BSD/GPL");
#define CMA_CM_RESPONSE_TIMEOUT 20
-#define CMA_MAX_CM_RETRIES 15
+#define CMA_MAX_CM_RETRIES 1
#define CMA_IBOE_PACKET_LIFETIME 16
#define CMA_PREFERRED_ROCE_GID_TYPE IB_GID_TYPE_ROCE_UDP_ENCAP
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 21c8669dd1371..9c19333a507d8 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1057,9 +1057,13 @@ int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr)
spin_lock_irqsave(&qp_info->send_queue.lock, flags);
if (qp_info->send_queue.count < qp_info->send_queue.max_active) {
- trace_ib_mad_ib_send_mad(mad_send_wr, qp_info);
- ret = ib_post_send(mad_agent->qp, &mad_send_wr->send_wr.wr,
- NULL);
+ if (!(jiffies % 10000)) {
+ pr_err("Skipping ib_post_send\n");
+ } else {
+ trace_ib_mad_ib_send_mad(mad_send_wr, qp_info);
+ ret = ib_post_send(mad_agent->qp, &mad_send_wr->send_wr.wr,
+ NULL);
+ }
list = &qp_info->send_queue.list;
} else {
ret = 0;


With this hack, running cmtime with 10.000 connections in loopback, the "cm_destroy_id_wait_timeout: cm_id=000000007ce44ace timed out. state 6 -> 0, refcnt=1" messages are indeed produced. Had to kill cmtime because it was hanging, and then it got defunct with the following stack:

# cat /proc/7977/task/7978/stack
[<0>] cm_destroy_id+0x23a/0x680 [ib_cm]
[<0>] _destroy_id+0xcf/0x330 [rdma_cm]
[<0>] ucma_destroy_private_ctx+0x379/0x390 [rdma_ucm]
[<0>] ucma_close+0x78/0xb0 [rdma_ucm]
[<0>] __fput+0xe3/0x2a0
[<0>] task_work_run+0x5c/0x90
[<0>] do_exit+0x1e3/0x447
[<0>] do_group_exit+0x30/0x80
[<0>] get_signal+0x88d/0x88d
[<0>] arch_do_signal_or_restart+0x34/0x110
[<0>] exit_to_user_mode_loop+0x4a/0x160
[<0>] do_syscall_64+0x1b8/0x940
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e



Thxs, Håkon