Re: [v3 PATCH net] net: enetc: fix the deadlock of enetc_mdio_lock

From: Jianpeng Chang

Date: Mon Oct 13 2025 - 23:07:46 EST



在 2025/10/10 19:08, Vladimir Oltean 写道:
CAUTION: This email comes from a non Wind River email account!
Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Fri, Oct 10, 2025 at 01:51:38PM +0300, Vladimir Oltean wrote:
On Fri, Oct 10, 2025 at 12:31:37PM +0300, Wei Fang wrote:
After applying the workaround for err050089, the LS1028A platform
experiences RCU stalls on RT kernel. This issue is caused by the
recursive acquisition of the read lock enetc_mdio_lock. Here list some
of the call stacks identified under the enetc_poll path that may lead to
a deadlock:

enetc_poll
-> enetc_lock_mdio
-> enetc_clean_rx_ring OR napi_complete_done
-> napi_gro_receive
-> enetc_start_xmit
-> enetc_lock_mdio
-> enetc_map_tx_buffs
-> enetc_unlock_mdio
-> enetc_unlock_mdio

After enetc_poll acquires the read lock, a higher-priority writer attempts
to acquire the lock, causing preemption. The writer detects that a
read lock is already held and is scheduled out. However, readers under
enetc_poll cannot acquire the read lock again because a writer is already
waiting, leading to a thread hang.

Currently, the deadlock is avoided by adjusting enetc_lock_mdio to prevent
recursive lock acquisition.

Fixes: 6d36ecdbc441 ("net: enetc: take the MDIO lock only once per NAPI poll
cycle")
Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@xxxxxxxxxxxxx>
Acked-by: Wei Fang <wei.fang@xxxxxxx>

Hi Vladimir,

Do you have any comments? This patch will cause the regression of performance
degradation, but the RCU stalls are more severe.

I'm fine with the change in principle. It's my fault because I didn't
understand how rwlock writer starvation prevention is implemented, I
thought there would be no problem with reentrant readers.

But I wonder if xdp_do_flush() shouldn't also be outside the enetc_lock_mdio()
section. Flushing XDP buffs with XDP_REDIRECT action might lead to
enetc_xdp_xmit() being called, which also takes the lock...
And I think the same concern exists for the xdp_do_redirect() calls.
Most of the time it will be fine, but when the batch fills up it will be
auto-flushed by bq_enqueue():

if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
bq_xmit_all(bq, 0);

Hi Vladimir, Wei,

If xdp_do_flush and xdp_do_redirect can potentially call enetc_xdp_xmit, we should move them outside of enetc_lock_mdio.

If there are no further comments, I will repost the patch with fixes for xdp_do_flush and xdp_do_redirect.


Thanks,

Jianpeng