[5.5] new issue with RoCE

From: Ian Kumlien
Date: Mon Jan 27 2020 - 16:38:42 EST


Hi, Since updating to 5.5 I've hit a new issue - testing mlx5 cards from work =)

On client. multiple:
[ 1546.585378] ------------[ cut here ]------------
[ 1546.585386] WARNING: CPU: 3 PID: 4576 at
drivers/iommu/dma-iommu.c:471 __iommu_dma_unmap+0x10a/0x120
[ 1546.585386] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs mlx5_ib
amdgpu chaoskey mfd_core gpu_sched ttm mlx5_core sp5100_tco ccp
rpcrdma ib_ipoib
[ 1546.585394] CPU: 3 PID: 4576 Comm: kworker/3:1H Not tainted 5.5.0 #248
[ 1546.585395] Hardware name: System manufacturer System Product
Name/Pro WS X570-ACE, BIOS 1201 11/18/2019
[ 1546.585398] Workqueue: ib-comp-wq ib_cq_poll_work
[ 1546.585401] RIP: 0010:__iommu_dma_unmap+0x10a/0x120
[ 1546.585402] Code: c0 74 0b 48 89 e6 4c 89 f7 e8 e2 b2 9b 00 48 c7
44 24 08 00 00 00 00 48 c7 44 24 10 00 00 00 00 48 c7 04 24 ff ff ff
ff eb 90 <0f> 0b eb 82 e8 cd 3f 93 ff 66 66 2e 0f 1f 84 00 00 00 00 00
66 90
[ 1546.585403] RSP: 0018:ffffa1e188c97da8 EFLAGS: 00010206
[ 1546.585405] RAX: 0000000000004000 RBX: 0000000000003000 RCX: 0000000000000001
[ 1546.585405] RDX: 0000000000000002 RSI: ffffffffffffe000 RDI: ffffa1e188c97d18
[ 1546.585406] RBP: ffffffff00000000 R08: ffff8a0d8794e010 R09: 0000000000002000
[ 1546.585407] R10: 0000000000000001 R11: 000ffffffffff000 R12: 0000000000003000
[ 1546.585407] R13: ffff8a0ff3ce6000 R14: ffff8a0ff98a7620 R15: ffff8a0dcd11a000
[ 1546.585409] FS: 0000000000000000(0000) GS:ffff8a0ffeac0000(0000)
knlGS:0000000000000000
[ 1546.585410] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1546.585410] CR2: 00007f7473740000 CR3: 000000074371a000 CR4: 0000000000340ee0
[ 1546.585411] Call Trace:
[ 1546.585426] rpcrdma_mr_put+0x11b/0x120 [rpcrdma]
[ 1546.585429] __ib_process_cq+0x76/0xd0
[ 1546.585430] ib_cq_poll_work+0x34/0xc0
[ 1546.585433] process_one_work+0x1e2/0x3c0
[ 1546.585436] worker_thread+0x4a/0x3d0
[ 1546.585438] kthread+0xfb/0x130
[ 1546.585440] ? process_one_work+0x3c0/0x3c0
[ 1546.585441] ? kthread_park+0x90/0x90
[ 1546.585443] ret_from_fork+0x22/0x40
[ 1546.585446] ---[ end trace edc64661ebf52144 ]---

On server, multiple:
[ 1141.838449] infiniband mlx5_1: dump_cqe:270:(pid 715): dump error cqe
[ 1141.838463] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1141.838469] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1141.838475] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1141.838481] 00000030: 00 00 00 00 00 00 88 13 08 00 09 57 2c 66 e0 d2
(pid increasing and last bytes seems to increase as well)

No error from a user perspective... yet though...