[net v2] net: dsa: microchip: fix race condition

From: Christian Eggers
Date: Tue Oct 06 2020 - 11:57:48 EST

Next message: muhammad . husaini . zulkifli: "[PATCH v3 0/2] mmc: sdhci-of-arasan: Enable UHS-1 support for Keem Bay SOC"
Previous message: Jarkko Sakkinen: "Re: [PATCH v39 21/24] x86/vdso: Implement a vDSO for Intel SGX enclave call"
Next in thread: Vladimir Oltean: "Re: [net v2] net: dsa: microchip: fix race condition"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Between queuing the delayed work and finishing the setup of the dsa
ports, the process may sleep in request_module() (via
phy_device_create()) and the queued work may be executed prior to the
switch net devices being registered. In ksz_mib_read_work(), a NULL
dereference will happen within netof_carrier_ok(dp->slave).

Not queuing the delayed work in ksz_init_mib_timer() makes things even
worse because the work will now be queued for immediate execution
(instead of 2000 ms) in ksz_mac_link_down() via
dsa_port_link_register_of().

Call tree:
ksz9477_i2c_probe()
\--ksz9477_switch_register()
\--ksz_switch_register()
+--dsa_register_switch()
| \--dsa_switch_probe()
| \--dsa_tree_setup()
| \--dsa_tree_setup_switches()
| +--dsa_switch_setup()
| | +--ksz9477_setup()
| | | \--ksz_init_mib_timer()
| | | |--/* Start the timer 2 seconds later. */
| | | \--schedule_delayed_work(&dev->mib_read, msecs_to_jiffies(2000));
| | \--__mdiobus_register()
| | \--mdiobus_scan()
| | \--get_phy_device()
| | +--get_phy_id()
| | \--phy_device_create()
| | |--/* sleeping, ksz_mib_read_work() can be called meanwhile */
| | \--request_module()
| |
| \--dsa_port_setup()
| +--/* Called for non-CPU ports */
| +--dsa_slave_create()
| | +--/* Too late, ksz_mib_read_work() may be called beforehand */
| | \--port->slave = ...
| ...
| +--Called for CPU port */
| \--dsa_port_link_register_of()
| \--ksz_mac_link_down()
| +--/* mib_read must be initialized here */
| +--/* work is already scheduled, so it will be executed after 2000 ms */
| \--schedule_delayed_work(&dev->mib_read, 0);
\-- /* here port->slave is setup properly, scheduling the delayed work should be safe */

Solution:
1. Do not queue (only initialize) delayed work in ksz_init_mib_timer().
2. Only queue delayed work in ksz_mac_link_down() if init is completed.
3. Queue work once in ksz_switch_register(), after dsa_register_switch()
has completed.

Fixes: 7c6ff470aa ("net: dsa: microchip: add MIB counter reading support")
Signed-off-by: Christian Eggers <ceggers@xxxxxxx>
---
v2:
---------
- no changes in the patch itself
- use correct subject-prefix
- changed wording of commit description
- added call tree to commit description
- added "Fixes:" tag

On Monday, 5 October 2020, 22:55:44 CEST, Vladimir Oltean wrote:
> On Mon, Oct 05, 2020 at 06:08:29PM +0200, Christian Eggers wrote:
> [...]
> Read this for more details
> https://www.kernel.org/doc/html/latest/networking/netdev-FAQ.html
Thank for pointing this out.

> FYI, you haven't even addressed the root cause of the problem, which is
> ksz_mib_read_work sticking its nose where it's not supposed to:
>
> /* Only read MIB counters when the port is told to do.
> * If not, read only dropped counters when link is not up.
> */
> if (!p->read) {
> const struct dsa_port *dp = dsa_to_port(dev->ds, i);
>
> if (!netif_carrier_ok(dp->slave))
> mib->cnt_ptr = dev->reg_mib_cnt;
> }
>
> This is simply Not Ok.
> Not only the dp->slave is on purpose registered outside of the driver's
> control (as you came to find out yourself), but not even all ports are
> user ports. For example, the CPU port doesn't have a valid struct
> net_device *slave pointer. You are just lucky that it's defined like
> this:
>
> struct dsa_port {
> /* A CPU port is physically connected to a master device.
> * A user port exposed to userspace has a slave device.
> */
> union {
> struct net_device *master;
> struct net_device *slave;
> };
>
> so the code is in fact checking the status of the master interface's link.
> But DSA doesn't assume that the *master and *slave pointers are under a
> union. That can change any day, and when it changes, the KSZ driver will
> break.
Calling dsa_is_user_port() before calling netif_carrier_ok() could improve the
situation a little bit...

/* Only read MIB counters when the port is told to do.
* If not, read only dropped counters when link is not up.
*/
port_r_cnt() is called independently of p->read and netif_carrier_ok()... What
is correct here (comment or code)?

> My personal feeling is that this driver hides a landmine beneath every
> line of code, and it isn't getting better.
https://github.com/Microchip-Ethernet/EVB-KSZ9477 ...

> Sure, you should absolutely add the call stack to the commit message,
> but how many people are going to git blame so they can see it. The code
> needs to be obviously correct.
>
> Things like needing to check dev->mib_read_interval as an indication
> whether the race between ksz_mac_link_down and ksz_switch_register is
> over are exactly the type of things that make it not fun to follow.
I needed some amount of time to understand the segfault and to draw the
call stack...

> If reading MIB counters for ports that are down is such a "waste of time"
> as per commit 7c6ff470aa867f53b8522a3a5c84c36ac7a20090, then how about
> scheduling the delayed work from .phylink_mac_link_up, and canceling it
> from .phylink_mac_link_down? Either that, or set a boolean variable to
> struct ksz_port p->link_up, to true or false respectively from the
> phylink callbacks, and using that as an indication whether to read the
> MIB counters or not, instead of accessing the potentially invalid
> dp->slave pointer? Would that work?
I am definitely not an expert for this driver. For starting/stopping the
delayed work on demand, a separate work struct for each port could be useful.
In this case, struct ksz_port would need a pointer to the ksz_device struct,
as the ports are allocated seperately and container_of() cannot be used.

Using a bool variable has the property, that reading the MIB will not be
performed "immediately" after phylink_mac_down(). But if I am correct, this
is also not the case today as the work is typically already queued when
ksz_mac_link_down() is executed.

- First call of ksz_mac_link_down:
Work is already queued (prior this patch) or will not be queued (after this
patch).

- Further calls:
Work is already queued (it requeues itself).

Result (please verify):
- Not scheduling the work in ksz_mac_link_down() won't change anything.
- Checking for mib_read_interval in ksz_switch_remove() can be obmitted,
as the condition is always true when ksz_switch_remove() is called.

> Sorry for rambling. I realize that there aren't probably a lot of things
> you can do better to fix this problem for stable, but maybe you could
> take some time and clean it up a little bit?
My original intention is to upstream the PTP functionality I have developed. I
hope that I can get this into without cleaning up the whole driver...

If you can confirm that checking mib_read_interval is not necessary in
ksz_switch_remove() and that scheduling the work in ksz_mac_link_down() makes
no sense, I can either extend this patch or create a follup up patch.

Best regards
Christian

drivers/net/dsa/microchip/ksz_common.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c
index 8e755b50c9c1..a94d2278b95c 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -103,14 +103,8 @@ void ksz_init_mib_timer(struct ksz_device *dev)

INIT_DELAYED_WORK(&dev->mib_read, ksz_mib_read_work);

- /* Read MIB counters every 30 seconds to avoid overflow. */
- dev->mib_read_interval = msecs_to_jiffies(30000);
-
for (i = 0; i < dev->mib_port_cnt; i++)
dev->dev_ops->port_init_cnt(dev, i);
-
- /* Start the timer 2 seconds later. */
- schedule_delayed_work(&dev->mib_read, msecs_to_jiffies(2000));
}
EXPORT_SYMBOL_GPL(ksz_init_mib_timer);

@@ -143,7 +137,9 @@ void ksz_mac_link_down(struct dsa_switch *ds, int port, unsigned int mode,

/* Read all MIB counters when the link is going down. */
p->read = true;
- schedule_delayed_work(&dev->mib_read, 0);
+ /* timer started */
+ if (dev->mib_read_interval)
+ schedule_delayed_work(&dev->mib_read, 0);
}
EXPORT_SYMBOL_GPL(ksz_mac_link_down);

@@ -446,6 +442,12 @@ int ksz_switch_register(struct ksz_device *dev,
return ret;
}

+ /* Read MIB counters every 30 seconds to avoid overflow. */
+ dev->mib_read_interval = msecs_to_jiffies(30000);
+
+ /* Start the MIB timer. */
+ schedule_delayed_work(&dev->mib_read, 0);
+
return 0;
}
EXPORT_SYMBOL(ksz_switch_register);
--
Christian Eggers
Embedded software developer

Arnold & Richter Cine Technik GmbH & Co. Betriebs KG
Sitz: Muenchen - Registergericht: Amtsgericht Muenchen - Handelsregisternummer: HRA 57918
Persoenlich haftender Gesellschafter: Arnold & Richter Cine Technik GmbH
Sitz: Muenchen - Registergericht: Amtsgericht Muenchen - Handelsregisternummer: HRB 54477
Geschaeftsfuehrer: Dr. Michael Neuhaeuser; Stephan Schenk; Walter Trauninger; Markus Zeiler

Next message: muhammad . husaini . zulkifli: "[PATCH v3 0/2] mmc: sdhci-of-arasan: Enable UHS-1 support for Keem Bay SOC"
Previous message: Jarkko Sakkinen: "Re: [PATCH v39 21/24] x86/vdso: Implement a vDSO for Intel SGX enclave call"
Next in thread: Vladimir Oltean: "Re: [net v2] net: dsa: microchip: fix race condition"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]