Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support

From: Tobias Waldekranz
Date: Wed Apr 14 2021 - 14:40:00 EST


On Wed, Apr 14, 2021 at 17:14, Marek Behun <marek.behun@xxxxxx> wrote:
> On Tue, 13 Apr 2021 20:16:24 +0200
> Tobias Waldekranz <tobias@xxxxxxxxxxxxxx> wrote:
>
>> You could imagine a different mode in which the DSA driver would receive
>> the bucket allocation from the bond/team driver (which in turn could
>> come all the way from userspace). Userspace could then implement
>> whatever strategy it wants to maximize utilization, though still bound
>> by the limitations of the hardware in terms of fields considered during
>> hashing of course.
>
> The problem is that even with the ability to change the bucket
> configuration however we want it still can happen with non-trivial
> probability that all (src,dst) pairs on the network will hash to one
> bucket.
>
> The probability of that happening is 1/(8^(n-1)) for n (src,dst) pairs.

Yes I understand all that, hence "though still bound by the limitations
of the hardware in terms of fields considered during hashing of course."

> On Turris Omnia the most common configuration is that the switch ports
> are bridged.
>
> If the user plugs only two devices into the lan ports, one would expect
> that both devices could utilize 1 gbps each. In this case there is
> 1/8 probability that both devices would hash to the same bucket. It is
> quite bad if multi-CPU upload won't work for 12.5% of our customers that
> are using our device in this way.

Agreed, but it is a category error to talk in terms of expectations and
desires here. I am pretty sure the silicon just does not have the gates
required to do per-port steering in combination with bridging. (Except
by using the TCAM).

> So if there is some reasonable solution how to implement multi-CPU via
> the port vlan mask, I will try to pursue this.

I hope whatever solution you come up with does not depend on the
destination being unknown. If the current patch works for the reason I
suspect, you will effectively limit the downstream bandwidth of all
connected stations to 1G minus the aggregated upstream rate. Example:

.------.
A --+ lan0 |
B --+ lan1 |
C --+ lan2 |
D --+ lan3 |
| |
+ wan |
'------'

If you run with this series applied, in this setup, and have A,B,C each
send a 10 kpps flow to the CPU, what is the observed rate on D? My
guess would be 30 kpps, as all traffic is being flooded as unknown
unicast. This is true also for net-next at the moment. To solve that you
have to load the CPU address in the ATU, at which point you have to
decide between cpu0 and cpu1.

In order to have two entries for the same destination, they must belong
to different FIDs. But that FID is also used for automatic learning. So
if all ports use their own FID, all the switched traffic will have to be
flooded instead, since any address learned on lan0 will be invisible to
lan1,2,3 and vice versa.