Re: [PATCH 2/3] net: dsa: b53: mmap: register MDIO Mux bus controller

From: Álvaro Fernández Rojas
Date: Fri Mar 17 2023 - 10:17:37 EST


Hi Vladimir

El vie, 17 mar 2023 a las 14:04, Vladimir Oltean (<olteanv@xxxxxxxxx>) escribió:
>
> On Fri, Mar 17, 2023 at 01:06:43PM +0100, Álvaro Fernández Rojas wrote:
> > Hi Vladimir,
> >
> > El vie, 17 mar 2023 a las 12:51, Vladimir Oltean (<olteanv@xxxxxxxxx>) escribió:
> > >
> > > On Fri, Mar 17, 2023 at 12:34:26PM +0100, Álvaro Fernández Rojas wrote:
> > > > b53 MMAP devices have a MDIO Mux bus controller that must be registered after
> > > > properly initializing the switch. If the MDIO Mux controller is registered
> > > > from a separate driver and the device has an external switch present, it will
> > > > cause a race condition which will hang the device.
> > >
> > > Could you describe the race in more details? Why does it hang the device?
> >
> > I didn't perform a full analysis on the problem, but what I think is
> > going on is that both b53 switches are probed and both of them fail
> > due to the ethernet device not being probed yet.
> > At some point, the internal switch is reset and not fully configured
> > and the external switch is probed again, but since the internal switch
> > isn't ready, the MDIO accesses for the external switch fail due to the
> > internal switch not being ready and this hangs the device because the
> > access to the external switch is done through the same registers from
> > the internal switch.
>
> The proposed solution is too radical for a problem that was not properly
> characterized yet, so this patch set has my temporary NACK.

Forgive me, but why do you consider this solution too radical?

>
> > But maybe Florian or Jonas can give some more details about the issue...
>
> I think you also have the tools necessary to investigate this further.
> We need to know what resource belonging to the switch is it that the
> MDIO mux needs. Where is the earliest place you can add the call to
> b53_mmap_mdiomux_init() such that your board works reliably? Note that
> b53_switch_register() indirectly calls b53_setup(). By placing this
> function where you have, the entirety of b53_setup() has finished
> execution, and we don't know exactly what is it from there that is
> needed.

In the following link you will find different bootlogs related to
different scenarios all of them with the same result: any attempt of
calling b53_mmap_mdiomux_init() earlier than b53_switch_register()
will either result in a kernel panic or a device hang:
https://gist.github.com/Noltari/b0bd6d5211160ac7bf349d998d21e7f7

1. before b53_switch_register():
[ 1.756010] bcm53xx 0.1:1e: found switch: BCM53125, rev 4
[ 1.761917] bcm53xx 0.1:1e: failed to register switch: -517
[ 1.767759] b53-switch 10e00000.switch: MDIO mux bus init
[ 1.774237] b53-switch 10e00000.switch: found switch: BCM63xx, rev 0
[ 1.785673] bcm6368-enetsw 1000d800.ethernet: IRQ tx not found
[ 1.795932] bcm6368-enetsw 1000d800.ethernet: mtd mac 4c:60:de:86:52:12
[ 1.884320] bcm7038-wdt 1000005c.watchdog: Registered BCM7038 Watchdog
[ 1.901957] NET: Registered PF_INET6 protocol family
[ 1.935223] Segment Routing with IPv6
[ 1.939160] In-situ OAM (IOAM) with IPv6
[ 1.943514] NET: Registered PF_PACKET protocol family
[ 1.949564] 8021q: 802.1Q VLAN Support v1.8
[ 1.987591] CPU 1 Unable to handle kernel paging request at virtual
address 00000000, epc == 804be000, ra == 804bbf3c
[ 1.998697] Oops[#1]:
[ 2.000995] CPU: 1 PID: 91 Comm: kworker/u4:3 Not tainted 5.15.98 #0
[ 2.007533] Workqueue: events_unbound deferred_probe_work_func
[ 2.013541] $ 0 : 00000000 00000001 804bdfd4 81ee6800
[ 2.018916] $ 4 : 834c7000 00000000 00000002 00000001
[ 2.024291] $ 8 : c0000000 00000110 00000114 00000000
[ 2.029668] $12 : 00000001 81cf2f8a fffffffc 00000000
[ 2.035043] $16 : 00000000 00000000 00000002 834bc680
[ 2.040420] $20 : 00000000 00000080 81c0700d 81f37a40
[ 2.045796] $24 : 00000018 00000000
[ 2.051171] $28 : 81f58000 81f59c80 80870000 804bbf3c
[ 2.056547] Hi : e6545baf
[ 2.059505] Lo : a4644567
[ 2.062462] epc : 804be000 mdio_mux_read+0x2c/0xd4
[ 2.067569] ra : 804bbf3c __mdiobus_read+0x20/0xc4
[ 2.072766] Status: 10008b03 KERNEL EXL IE
[ 2.077066] Cause : 00800008 (ExcCode 02)
[ 2.081187] BadVA : 00000000
[ 2.084145] PrId : 0002a070 (Broadcom BMIPS4350)
[ 2.088983] Modules linked in:
[ 2.092119] Process kworker/u4:3 (pid: 91, threadinfo=(ptrval),
task=(ptrval), tls=00000000)
[ 2.100812] Stack : 00000080 80255cfc 81c0700d 81f37a40 834c7000
00000000 00000002 834c7558
[ 2.109438] 00000002 804bbf3c 00000000 83501f78 834bb0b0 834df478
8194eae0 834c7000
[ 2.118058] 00000000 804bc020 ffffffed 83508780 00000000 00000004
834bb0b0 81f5b800
[ 2.126677] 808eb104 808eb104 81950000 804c48cc 00000003 81f5b800
81f5b800 00000000
[ 2.135297] 808eb104 81f5b800 808eb104 804bc6c0 834c7570 10008b01
81f5b800 81f5b8e0
[ 2.143925] ...
[ 2.146435] Call Trace:
[ 2.148943] [<804be000>] mdio_mux_read+0x2c/0xd4
[ 2.153697] [<804bbf3c>] __mdiobus_read+0x20/0xc4
[ 2.158533] [<804bc020>] mdiobus_read+0x40/0x6c
[ 2.163193] [<804c48cc>] b53_mdio_probe+0x38/0x16c
[ 2.168120] [<804bc6c0>] mdio_probe+0x34/0x7c
[ 2.172600] [<80437930>] really_probe.part.0+0xac/0x35c
[ 2.177976] [<80437c8c>] __driver_probe_device+0xac/0x164
[ 2.183531] [<80437d90>] driver_probe_device+0x4c/0x158
[ 2.188907] [<80438444>] __device_attach_driver+0xd0/0x15c
[ 2.194552] [<804353a0>] bus_for_each_drv+0x70/0xb0
[ 2.199569] [<804380f0>] __device_attach+0xc0/0x1d8
[ 2.204588] [<804367f4>] bus_probe_device+0x9c/0xb8
[ 2.209604] [<80436d58>] deferred_probe_work_func+0x94/0xd4
[ 2.215339] [<80058314>] process_one_work+0x290/0x4d0
[ 2.220536] [<800588ac>] worker_thread+0x358/0x614
[ 2.225464] [<80061064>] kthread+0x148/0x16c
[ 2.229854] [<80013848>] ret_from_kernel_thread+0x14/0x1c
[ 2.235413]
[ 2.236931] Code: 00a0a025 8e700004 00c09025 <8e040000> 0c1ba5d8
24840558 8e020010 8e06000c 8e65000c
[ 2.247011]
[ 2.248726] ---[ end trace 9e5942a13795eb30 ]---
[ 2.253490] Kernel panic - not syncing: Fatal exception
[ 2.258831] Rebooting in 1 seconds..

2. before dsa_register_switch():
[ 1.759901] bcm53xx 0.1:1e: failed to register switch: -19
[ 1.765837] b53-switch 10e00000.switch: MDIO mux bus init
[ 1.771412] b53-switch 10e00000.switch: found switch: BCM63xx, rev 0
[ 1.782683] bcm6368-enetsw 1000d800.ethernet: IRQ tx not found
[ 1.793149] bcm6368-enetsw 1000d800.ethernet: mtd mac 4c:60:de:86:52:12
[ 1.875791] bcm7038-wdt 1000005c.watchdog: Registered BCM7038 Watchdog
[ 1.893480] NET: Registered PF_INET6 protocol family
[ 1.922283] Segment Routing with IPv6
[ 1.926192] In-situ OAM (IOAM) with IPv6
[ 1.930392] NET: Registered PF_PACKET protocol family
[ 1.936526] 8021q: 802.1Q VLAN Support v1.8
[ 2.245288] bcm53xx 1.1:1e: failed to register switch: -19
[ 2.251210] b53-switch 10e00000.switch: MDIO mux bus init
[ 2.256761] b53-switch 10e00000.switch: found switch: BCM63xx, rev 0
*** Device hangs ***

3. before b53_switch_init():
[ 1.757728] bcm53xx 0.1:1e: failed to register switch: -19
[ 1.763689] b53-switch 10e00000.switch: MDIO mux bus init
[ 1.769780] b53-switch 10e00000.switch: found switch: BCM63xx, rev 0
[ 1.781130] bcm6368-enetsw 1000d800.ethernet: IRQ tx not found
[ 1.790996] bcm6368-enetsw 1000d800.ethernet: mtd mac 4c:60:de:86:52:12
[ 1.875775] bcm7038-wdt 1000005c.watchdog: Registered BCM7038 Watchdog
[ 1.893523] NET: Registered PF_INET6 protocol family
[ 1.921605] Segment Routing with IPv6
[ 1.925513] In-situ OAM (IOAM) with IPv6
[ 1.929695] NET: Registered PF_PACKET protocol family
[ 1.935809] 8021q: 802.1Q VLAN Support v1.8
[ 2.244702] bcm53xx 1.1:1e: failed to register switch: -19
[ 2.250653] b53-switch 10e00000.switch: MDIO mux bus init
[ 2.256751] b53-switch 10e00000.switch: found switch: BCM63xx, rev 0
*** Device hangs ***

I will be happy to do any more tests if needed.

Best regards,
Álvaro.