Re: [Intel-wired-lan] [PATCH v2] igb: Fix watchdog_task race with shutdown

From: Ian Ray
Date: Fri Jun 27 2025 - 09:29:20 EST


On Mon, Jun 16, 2025 at 02:47:29PM -0700, Jacob Keller wrote:
> On 6/10/2025 5:44 AM, Ian Ray wrote:
> > On Mon, Jun 09, 2025 at 04:10:39PM -0700, Jakub Kicinski wrote:
:
> > IIUC set_bit() is an atomic operation (via bitops.h), and so
> > my previous comment still stands.
> >
> > (Sorry if I have misunderstood your question.)
> >
> > Either watchdog_task runs just before __IGB_DOWN is set (and
> > the timer is stopped by this patch) -- or watchdog_task runs
> > just after __IGB_DOWN is set (and thus the timer will not be
> > restarted).
> >
> > In both cases, the final cancel_work_sync ensures that the
> > watchdog_task completes before igb_down() continues.
> >
> > Regards,
> > Ian
>
> Hmm. Well set_bit is atomic, but I don't think it has ordering
> guarantees on its own. Wouldn't we need to be using a barrier here to
> guarantee ordering here?
>
> Perhaps cancel_work_sync has barriers implied and that makes this work
> properly?

Ah, I see. I checked the cancel_work_documentation and implementation
and I am not sure we can make any assumptions about barriers.

Would two additional calls to smp_mb__after_atomic() be acceptable?
Something like this (on top of this series v2).

-- >8 --
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a65ae7925ae8..9b63dc594454 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2179,6 +2179,7 @@ void igb_down(struct igb_adapter *adapter)
* disable watchdog from being rescheduled.
*/
set_bit(__IGB_DOWN, &adapter->state);
+ smp_mb__after_atomic();
timer_delete_sync(&adapter->watchdog_timer);
timer_delete_sync(&adapter->phy_info_timer);

@@ -3886,6 +3887,7 @@ static void igb_remove(struct pci_dev *pdev)
* disable watchdog from being rescheduled.
*/
set_bit(__IGB_DOWN, &adapter->state);
+ smp_mb__after_atomic();
timer_delete_sync(&adapter->watchdog_timer);
timer_delete_sync(&adapter->phy_info_timer);
-- >8 --

Thanks,
Ian

>
> > ORDERING
> > --------
> >
> > Like with atomic_t, the rule of thumb is:
> >
> > - non-RMW operations are unordered;
> >
> > - RMW operations that have no return value are unordered;
> >
> > - RMW operations that have a return value are fully ordered.
> >
> > - RMW operations that are conditional are fully ordered.
> >
> > Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics,
> > clear_bit_unlock() which has RELEASE semantics and test_bit_acquire which has
> > ACQUIRE semantics.
> >
>
> set_bit is listed as a RMW without a return value, so its unordered.
> That makes me think we'd want clear_bit_unlock() if the cancel_work_sync
> itself doesn't provide the barriers we need.
>
> Thanks,
> Jake