Re: Extend irq_set_affinity_notifier() to use a call chain

From: Amir Vadai
Date: Mon May 26 2014 - 08:04:17 EST


On 5/26/2014 2:34 PM, Thomas Gleixner wrote:
On Mon, 26 May 2014, Amir Vadai wrote:

On 5/26/2014 2:15 PM, Thomas Gleixner wrote:
On Sun, 25 May 2014, Amir Vadai wrote:
In order to do that, I need to add a new irq affinity notification
callback (In addition to the existing cpu_rmap notification). For
that I would like to extend irq_set_affinity_notifier() to have a
notifier call-chain instead of a single notifier callback.

Why? "I would like" is a non argument.

Current implementation enables only one callback to be registered for irq
affinity change notifications.

I'm well aware of that.

cpu_rmap is registered be notified - for RFS purposes. mlx4_en (and
probably other network drivers) needs to be notified too, in order
to stop the napi polling on the old cpu and move to the new one. To
enable more than 1 notification callbacks, I suggest to use a
notifier call chain.

You are not describing what needs to be notified and why. Please
explain the details of that and how the RFS (whatever that is) and the
network driver are connected
The goal of RFS is to increase datacache hitrate by steering
kernel processing of packets in multi-queue devices to the CPU where the application thread consuming the packet is running.

In order to select the right queue, the networking stack needs to have a reverse map of IRQ affinty. This is the rmap that was added by Ben Hutchings [1]. To keep the rmap updated, cpu_rmap registers on the affinity notify.

This is the first affinity callback - it is located as a general library and not under net/...

The motivation to the second irq affinity callback is:
When traffic starts, first packet fires an interrupt which starts the napi polling on the cpu according the irq affinity.
If there is always packets to be consumed by the napi polling, no further interrupts will be fired, and napi will consume all the packets from the cpu it was started.
If the user changes the irq affinity, napi polling will continue to be done from the original cpu.
Only when the traffic will pause, napi session will be finished, and when traffic will resume, the new napi session will be done from the new cpu.
This is a problematic behavior, because from the user point of view, cpu affinity can't be changed in a non-stop traffic scenario.

To solve this, the network driver should be notified on irq affinity change event, and restart the napi session. This could be done by closing the napi session and arming the interrupts. Next packet arrives will trigger an interrupt and napi will session will start, this time on the new CPU.

> and why this notification cannot be
> propagated inside the network stack itself.

To my understanding, those are two different consumers to the same event, one is a general library to maintain a reverse irq affinity map, and the other is networking specific, and maybe even a networking driver specific.

[1] - c39649c lib: cpu_rmap: CPU affinity reverse-mapping

Thanks,
Amir


notifier chains are almost always a clear sign for a design disaster
and I'm not going to even think about it before I do not have a
concice explanation of the problem at hand and why a notifier chain is
a good solution.

Thanks,

tglx



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/