Re: commit 16ecba59 breaks 82574L under heavy load.

From: Lennart Sorensen
Date: Fri Jul 21 2017 - 12:09:44 EST


On Fri, Jul 21, 2017 at 11:27:09AM -0400, wrote:
> On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> > Could you please test the following patch and let me know if it:
> > 1) reduces the interrupt rate of the Other msi-x vector
> > 2) avoids the link flaps
> > or
> > 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> > In this case, please paste icr values printed.
>
> I will give it a try.

So test looks excellent. Seems to only get interrupts when link state
actually changes now.

> Another odd behaviour I see is that the driver will hang in
> napi_synchronize on shutdown if there is traffic at the time (at least
> I think that's the trigger, maybe the trigger is if there has been an
> overload of traffic and the backlog in napi was used).
>
> From doing some searching, this seems to be a problem that has plagued
> some people for years with this driver.
>
> I am having trouble figuring out exactly what napi_synchronize is waiting
> for and who is supposed to toggle the flag it is waiting on. The flag
> appears to work backwards from what I would have expected it to do.
> I see lots of places that can set the bit, but only napi_enable seems
> to clear it again, and I don't see how that would get called for all
> the places that potentially set the bit.

I just realized NAPI_STATE_SCHED and NAPIF_STATE_SCHED are the same
thing and I need to look at both of those.

Still something seems odd in some corner case where napi gets stuck and
you can't close the port anymore due to napi_synchronize never being
able to finish. Some traffic pattern causes that SCHED state bit to
get into the wrong state and nothing ever clears it. Even managed to
see it get stuck so it never passed traffic again and hung on shutdown.
The napi poll was never called again.

--
Len Sorensen