Re: [PATCH for-next] Revert "IB/mlx5: Don't return errors from poll_cq"

From: Leon Romanovsky
Date: Fri Mar 04 2022 - 12:24:10 EST


On Fri, Mar 04, 2022 at 10:53:34AM +0000, Haakon Bugge wrote:
>
>
> > On 3 Mar 2022, at 20:09, Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> >
> > On Thu, Mar 03, 2022 at 02:50:17PM +0100, Håkon Bugge wrote:
> >> This reverts commit dbdf7d4e7f911f79ceb08365a756bbf6eecac81c.
> >>
> >> Commit dbdf7d4e7f91 ("IB/mlx5: Don't return errors from poll_cq") is
> >> needed, when driver/fw communication gets wedged.
> >>
> >> With a large fleet of systems equipped with CX-5, we have observed the
> >> following mlx5 error message:
> >>
> >> wait_func:945:(pid xxx): ACCESS_REG(0x805) timeout. Will cause a
> >> leak of a command resource
> >
> > It is arguably FW issue. Please contact your Nvidia support representative.
>
> The RC for the whacked driver/fw communication has been raised with Nvidia support. This commit is to avoid the kernel to crash when this situation arises. And inevitable, it may happen.

I'm confident that support team will find best possible solution to the
raised issue.

Thanks

>
>
> Thxs, Håkon