Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

From: Josh Hunt
Date: Thu Aug 20 2020 - 15:06:16 EST


Hi Jike

On 8/20/20 12:43 AM, Jike Song wrote:
Hi Josh,


We met possibly the same problem when testing nvidia/mellanox's
GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
can also have a try?

We also did something similar where we've switched over to using the fq scheduler everywhere for now. We believe the bug is in the nolock code which only pfifo_fast uses atm, but we've been unable to come up with a satisfactory solution.


Besides, our testing is pretty complex, do you have a quick test to
reproduce it?


Unfortunately we don't have a simple test case either. Our current reproducer is complex as well, although it would seem like we should be able to come up with something where you have maybe 2 threads trying to send on the same tx queue running pfifo_fast every few hundred milliseconds and not much else/no other tx traffic on that queue. IIRC we believe the scenario is when one thread is in the process of dequeuing a packet while another is enqueuing, the enqueue-er (word? :)) sees the dequeue is in progress and so does not xmit the packet assuming the dequeue operation will take care of it. However b/c the dequeue is in the process of completing it doesn't and the newly enqueued packet stays in the qdisc until another packet is enqueued pushing both out.

Given that we have a workaround with using fq or any other qdisc not named pfifo_fast this has gotten bumped down in priority for us. I would like to work on a reproducer at some point, but won't likely be for a few weeks :(

Josh