Re: [PATCH v3 net 1/4] pds_core: Prevent possible adminq overflow/stuck condition
From: Jacob Keller
Date: Wed Apr 16 2025 - 19:35:23 EST
On 4/16/2025 1:49 PM, Nelson, Shannon wrote:
> On 4/16/2025 1:13 PM, Jacob Keller wrote:
>>
>> On 4/15/2025 4:29 PM, Shannon Nelson wrote:
>>> From: Brett Creeley <brett.creeley@xxxxxxx>
>>>
>>> The pds_core's adminq is protected by the adminq_lock, which prevents
>>> more than 1 command to be posted onto it at any one time. This makes it
>>> so the client drivers cannot simultaneously post adminq commands.
>>> However, the completions happen in a different context, which means
>>> multiple adminq commands can be posted sequentially and all waiting
>>> on completion.
>>>
>>> On the FW side, the backing adminq request queue is only 16 entries
>>> long and the retry mechanism and/or overflow/stuck prevention is
>>> lacking. This can cause the adminq to get stuck, so commands are no
>>> longer processed and completions are no longer sent by the FW.
>>>
>>> As an initial fix, prevent more than 16 outstanding adminq commands so
>>> there's no way to cause the adminq from getting stuck. This works
>>> because the backing adminq request queue will never have more than 16
>>> pending adminq commands, so it will never overflow. This is done by
>>> reducing the adminq depth to 16.
>>>
>>
>> What happens if a client driver tries to enqueue a request when the
>> adminq is full? Does it just block until there is space, presumably
>> holding the adminq_lock the entire time to prevent someone else from
>> inserting?
>
> Right now we will return -ENOSPC and it is up to the client to decide
> whether or not it wants to do a retry.
>
> We have another patch that has pdsc_adminq_post() doing a limited retry
> loop which was part of the original posting [1], but Kuba suggested
> using a semaphore instead. That sent us down a redesign branch that we
> haven't been able to spend time on. We'd like to have kept the retry
> loop patch until then to at least mitigate the situation, but the
> discussion got dropped.
Sure. This fix makes sense in that context.
Reviewed-by: Jacob Keller <jacob.e.keller@xxxxxxxxx>
>
> sln
>
> [1]
> https://lore.kernel.org/netdev/20250129004337.36898-3-shannon.nelson@xxxxxxx/