On Thu, 24 Jul 2025 13:46:08 +0300 Tariq Toukan wrote:
Design alternatives considered:
1. Recover all queues upon any error:
A brute-force approach that recovers all queues on any error.
While simple, it is overly aggressive and disrupts unaffected queues
unnecessarily. Also, because this is handled entirely within the
driver, it leads to a driver-specific implementation rather than a
generic one.
2. Per-queue reporter:
This design would isolate recovery handling per SQ or RQ, effectively
removing interdependencies between queues. While conceptually clean,
it introduces significant scalability challenges as the number of
queues grows, as well as synchronization challenges across multiple
reporters.
3. Error aggregation with delayed handling:
Errors arriving during the grace period are saved and processed after
it ends. While addressing the issue of related errors whose recovery
is aborted as grace period started, this adds complexity due to
synchronization needs and contradicts the assumption that no errors
should occur during a healthy system’s grace period. Also, this
breaks the important role of grace period in preventing an infinite
loop of immediate error detection following recovery. In such cases
we want to stop.
4. Allowing a fixed burst of errors before starting grace period:
Allows a set number of recoveries before the grace period begins.
However, it also requires limiting the error reporting window.
To keep the design simple, the burst threshold becomes redundant.
We're talking about burst on order of 100s, right?
The implementation
is quite simple, store an array the size of burst in which you can
save recovery timestamps (in a circular fashion). On error, count
how many entries are in the past N msecs.
It's a clear generalization of current scheme which can be thought of
as having an array of size 1 (only one most recent recovery time is
saved).
The grace period delay design was chosen for its simplicity and
precision in addressing the problem at hand. It effectively captures
the temporal correlation of related errors and aligns with the original
intent of the grace period as a stabilization window where further
errors are unexpected, and if they do occur, they indicate an abnormal
system state.
Admittedly part of what I find extremely confusing when thinking about
this API is that the period when recovery is **not** allowed is called
"grace period".
Now we add something called "grace period delay" inIt can be named after what it achieves (allows handling of more errors) or what it is (a shift of the grace_period). I'm fine with both, don't have strong preference.
some places in the code referred to as "reporter_delay"..
It may be more palatable if we named the first period "error burst
period" and, well, the later I suppose it's too late to rename..