RE: [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock

From: Keller, Jacob E
Date: Mon Jun 02 2025 - 17:33:10 EST




> -----Original Message-----
> From: Joe Damato <jdamato@xxxxxxxxxx>
> Sent: Monday, June 2, 2025 1:32 PM
> To: Jakub Kicinski <kuba@xxxxxxxxxx>
> Cc: Stanislav Fomichev <stfomichev@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
> john.cs.hey@xxxxxxxxx; Keller, Jacob E <jacob.e.keller@xxxxxxxxx>;
> syzbot+846bb38dc67fe62cc733@xxxxxxxxxxxxxxxxxxxxxxxxx; Nguyen, Anthony L
> <anthony.l.nguyen@xxxxxxxxx>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@xxxxxxxxx>; Andrew Lunn <andrew+netdev@xxxxxxx>; David
> S. Miller <davem@xxxxxxxxxxxxx>; Eric Dumazet <edumazet@xxxxxxxxxx>;
> Paolo Abeni <pabeni@xxxxxxxxxx>; moderated list:INTEL ETHERNET DRIVERS
> <intel-wired-lan@xxxxxxxxxxxxxxxx>; open list <linux-kernel@xxxxxxxxxxxxxxx>
> Subject: Re: [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock
>
> On Fri, May 30, 2025 at 06:31:40PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 May 2025 12:45:13 -0700 Joe Damato wrote:
> > > > nit: as Jakub mentioned in another thread, it seems more about the
> > > > flush_work waiting for the reset_task to complete rather than
> > > > wq mutexes (which are fake)?
> > >
> > > Hm, I probably misunderstood something. Also, not sure what you
> > > meant by the wq mutexes being fake?
> > >
> > > My understanding (which is prob wrong) from the syzbot and user
> > > report was that the order of wq mutex and rtnl are inverted in the
> > > two paths, which can cause a deadlock if both paths run.
> >
> > Take a look at touch_work_lockdep_map(), theres nosaj thing as wq mutex.
> > It's just a lockdep "annotation" that helps lockdep connect the dots
> > between waiting thread and the work item, not a real mutex. So the
> > commit msg may be better phrased like this (modulo the lines in front):
> >
> > CPU 0:
> > , - RTNL is held
> > / - e1000_close
> > | - e1000_down
> > +- - cancel_work_sync (cancel / wait for e1000_reset_task())
> > |
> > | CPU 1:
> > | - process_one_work
> > \ - e1000_reset_task
> > `- take RTNL
>
> OK, I'll resubmit shortly with the following commit message:
>
> e1000: Move cancel_work_sync to avoid deadlock
>
> Previously, e1000_down called cancel_work_sync for the e1000 reset task
> (via e1000_down_and_stop), which takes RTNL.
>
> As reported by users and syzbot, a deadlock is possible in the following
> scenario:
>
> CPU 0:
> - RTNL is held
> - e1000_close
> - e1000_down
> - cancel_work_sync (cancel / wait for e1000_reset_task())
>
> CPU 1:
> - process_one_work
> - e1000_reset_task
> - take RTNL
>
> To remedy this, avoid calling cancel_work_sync from e1000_down
> (e1000_reset_task does nothing if the device is down anyway). Instead,
> call cancel_work_sync for e1000_reset_task when the device is being
> removed.

Acked-by: Jacob Keller <jacob.e.keller@xxxxxxxxx>