Re: [PATCH 2/3] softirq: avoid spurious stalls due to need_resched()

From: Jakub Kicinski
Date: Tue Jan 31 2023 - 17:32:19 EST


On Thu, 22 Dec 2022 14:12:43 -0800 Jakub Kicinski wrote:
> need_resched() added in commit c10d73671ad3 ("softirq: reduce latencies")
> does improve latency for real workloads (for example memcache).
> Unfortunately it triggers quite often even for non-network-heavy apps
> (~900 times a second on a loaded webserver), and in small fraction of
> cases whatever the scheduler decided to run will hold onto the CPU
> for the entire time slice.
>
> 10ms+ stalls on a machine which is not actually under overload cause
> erratic network behavior and spurious TCP retransmits. Typical end-to-end
> latency in a datacenter is < 200us so its common to set TCP timeout
> to 10ms or less.
>
> The intent of the need_resched() is to let a low latency application
> respond quickly and yield (to ksoftirqd). Put a time limit on this dance.
> Ignore the fact that ksoftirqd is RUNNING if we were trying to be nice
> and the application did not yield quickly.
>
> On a webserver loaded at 90% CPU this change reduces the numer of 8ms+
> stalls the network softirq processing sees by around 10x (2/sec -> 0.2/sec).
> It also seems to reduce retransmissions by ~10% but the data is quite
> noisy.

Peter, is there a chance you could fold this patch into your ongoing
softirq rework? We can't both work on softirq in parallel, unfortunately
and this improvement is really key to counter balance whatever
heuristics CFS accumulated between 5.12 and 5.19 :(
Not to use the "r-word".

I can spin a version of this on top of your core/softirq branch, would
that work?