Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter

From: Jakub Kicinski
Date: Fri Jul 18 2025 - 20:47:54 EST

Next message: Jakub Kicinski: "Re: [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable"
Previous message: kernel test robot: "Re: [PATCH] media: b2c2: flexcop-eeprom: Fix assignment in if condition"
In reply to: Tariq Toukan: "[PATCH net-next 5/5] net/mlx5e: Set default grace period delay for TX and RX reporters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote:
> Currently, the devlink health reporter initiates the grace period
> immediately after recovering an error, which blocks further recovery
> attempts until the grace period concludes. Since additional errors
> are not generally expected during this short interval, any new error
> reported during the grace period is not only rejected but also causes
> the reporter to enter an error state that requires manual intervention.
>
> This approach poses a problem in scenarios where a single root cause
> triggers multiple related errors in quick succession - for example,
> a PCI issue affecting multiple hardware queues. Because these errors
> are closely related and occur rapidly, it is more effective to handle
> them together rather than handling only the first one reported and
> blocking any subsequent recovery attempts. Furthermore, setting the
> reporter to an error state in this context can be misleading, as these
> multiple errors are manifestations of a single underlying issue, making
> it unlike the general case where additional errors are not expected
> during the grace period.
>
> To resolve this, introduce a configurable grace period delay attribute
> to the devlink health reporter. This delay starts when the first error
> is recovered and lasts for a user-defined duration. Once this grace
> period delay expires, the actual grace period begins. After the grace
> period ends, a new reported error will start the same flow again.
>
> Timeline summary:
>
> ----|--------|------------------------------/----------------------/--
> error is error is grace period delay grace period
> reported recovered (recoveries allowed) (recoveries blocked)
>
> With grace period delay, create a time window during which recovery
> attempts are permitted, allowing all reported errors to be handled
> sequentially before the grace period starts. Once the grace period
> begins, it prevents any further error recoveries until it ends.

We are rate limiting recoveries, the "networking solution" to the
problem you're describing would be to introduce a burst size.
Some kind of poor man's token bucket filter.

Could you say more about what designs were considered and why this
one was chosen?

Next message: Jakub Kicinski: "Re: [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable"
Previous message: kernel test robot: "Re: [PATCH] media: b2c2: flexcop-eeprom: Fix assignment in if condition"
In reply to: Tariq Toukan: "[PATCH net-next 5/5] net/mlx5e: Set default grace period delay for TX and RX reporters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]