Re: [PATCH] rps: process the skb directly if rps cpu not changed

From: Yunsheng Lin
Date: Thu Mar 23 2023 - 06:49:17 EST


On 2023/3/23 18:04, xu xin wrote:
>> On 2023/3/22 15:24, xu xin wrote:
>>> [So sorry, I made a mistake in the reply title]
>>>
>>> On 2023/3/21 20:12, yang.yang29@xxxxxxxxxx wrote:
>>>>> From: xu xin <xu.xin16@xxxxxxxxxx>
>>>>>
>>>>> In the RPS procedure of NAPI receiving, regardless of whether the
>>>>> rps-calculated CPU of the skb equals to the currently processing CPU, RPS
>>>>> will always use enqueue_to_backlog to enqueue the skb to per-cpu backlog,
>>>>> which will trigger a new NET_RX softirq.
>>>>
>>>> Does bypassing the backlog cause out of order problem for packet handling?
>>>> It seems currently the RPS/RFS will ensure order delivery,such as:
>>>> https://elixir.bootlin.com/linux/v6.3-rc3/source/net/core/dev.c#L4485
>>>>
>>>> Also, this is an optimization, it should target the net-next branch:
>>>> [PATCH net-next] rps: process the skb directly if rps cpu not changed
>>>>
>>>
>>> Well, I thought the patch would't break the effort RFS tried to avoid "Out of
>>> Order" packets. But thanks for your reminder, I rethink it again, bypassing the
>>> backlog from "netif_receive_skb_list" will mislead RFS's judging if all
>>> previous packets for the flow have been dequeued, where RFS thought all packets
>>> have been dealed with, but actually they are still in skb lists. Fortunately,
>>> bypassing the backlog from "netif_receive_skb" for a single skb is okay and won't
>>> cause OOO packets because every skb is processed serially by RPS and sent to the
>>> protocol stack as soon as possible.
>>
>> Suppose a lot of skbs have been queued to the backlog waiting to
>> processed and passed to the stack when current_cpu is not the same
>> as the target cpu,
>
> Well. I'm afraid that what we mean by current_cpu may be different. The
> "current_cpu" in my patch refers to the cpu NAPI poll is running on (Or
> the cpu that the skb origins from).

That's what I meant too. current_cpu refers to the cpu on which the driver's
NAPI poll is running.

>
>> then current_cpu is changed to be the same as the
>> target cpu, with your patch, new skb will be processed and passed to
>> the stack immediately, which may bypass the old skb in the backlog.
>>
> I think Nop, RFS procedure won't let target cpu switch into a new cpu
> if there are still old skbs in the backlog of last recorded cpu. So the
> target cpu of the new skb will never equal to current_cpu if old skb in the
> backlog.
> ==========================================================================
> Let me draw the situation you described: At the time of T1, the app runs
> on cpu-0, so there are many packets queueing into the rxqueue-0 by RFS from
> CPU-1(suppose NAPI poll processing on cpu-1). Then, suddenly at the time of
> T2, the app tranfers to cpu-1, RFS know there are still old skb in rxqueue-0,
> so get_rps_cpu will not return a value of cpu-1, but cpu-0 instead.
>
> ========================================================
> When T1, app runs on cpu-0:
> APP
> -----------------------------
> | | | |
> |cpu-0 | |cpu-1 |
> |stack | |stack |
> | | | |
> ^
> |=|
> |=| | |
> |=| | |
> (rxqueue-0) (rxqueue-1,empty)
> ^<--
> <--
> <--
> <-- packet(poll on cpu1)
>
> ===========================================================
> When T2, app tranfer to cpu-1, target cpu is still on cpu-0:
> APP
> ----------------------------
> | | | |
> |cpu-0 | |cpu-1 |
> |stack | |stack |
> | | | |
> ^
> |
> |=| | |
> |=| | |
> (rxqueue-0) (rxqueue-2,empty)
> ^<--
> <--
> <--
> <-- packet(poll on cpu1)

what about the NAPI poll changing between cpu0 and cpu1 while
APP stays on the same cpu?

Note that backlog queue is per cpu.

>
> ===================================
>
> Thanks for your reply.
>
>>>
>>> If I'm correct, the code as follws can fix this.
>>>
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -5666,8 +5666,9 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
>>> if (static_branch_unlikely(&rps_needed)) {
>>> struct rps_dev_flow voidflow, *rflow = &voidflow;
>>> int cpu = get_rps_cpu(skb->dev, skb, &rflow);
>>> + int current_cpu = smp_processor_id();
>>>
>>> - if (cpu >= 0) {
>>> + if (cpu >= 0 && cpu != current_cpu) {
>>> ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>> rcu_read_unlock();
>>> return ret;
>>> @@ -5699,11 +5700,15 @@ void netif_receive_skb_list_internal(struct list_head *head)
>>> list_for_each_entry_safe(skb, next, head, list) {
>>> struct rps_dev_flow voidflow, *rflow = &voidflow;
>>> int cpu = get_rps_cpu(skb->dev, skb, &rflow);
>>> + int current_cpu = smp_processor_id();
>>>
>>> if (cpu >= 0) {
>>> /* Will be handled, remove from list */
>>> skb_list_del_init(skb);
>>> - enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>> + if (cpu != current_cpu)
>>> + enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
>>> + else
>>> + __netif_receive_skb(skb);
>>> }
>>> }
>>>
>>>
>>> Thanks.
>>>
>>>>>
>>>>> Actually, it's not necessary to enqueue it to backlog when rps-calculated
>>>>> CPU id equals to the current processing CPU, and we can call
>>>>> __netif_receive_skb or __netif_receive_skb_list to process the skb directly.
>>>>> The benefit is that it can reduce the number of softirqs of NET_RX and reduce
>>>>> the processing delay of skb.
>>>>>
>>>>> The measured result shows the patch brings 50% reduction of NET_RX softirqs.
>>>>> The test was done on the QEMU environment with two-core CPU by iperf3.
>>>>> taskset 01 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>>> taskset 02 iperf3 -c 192.168.2.250 -t 3 -u -R;
>>>>>
>>>>> Previous RPS:
>>>>> CPU0 CPU1
>>>>> NET_RX: 45 0 (before iperf3 testing)
>>>>> NET_RX: 1095 241 (after iperf3 testing)
>>>>>
>>>>> Patched RPS:
>>>>> CPU0 CPU1
>>>>> NET_RX: 28 4 (before iperf3 testing)
>>>>> NET_RX: 573 32 (after iperf3 testing)
>>>>
>>>> Sincerely.
>>>> Xu Xin
>>> .
>>>
> .
>