Re: [PATCH net-next v3] wireless-drivers: rtnetlink wifi simulation device

From: Cody Schuffelen
Date: Tue Nov 20 2018 - 22:21:21 EST


On Tue, Oct 9, 2018 at 1:25 AM Johannes Berg <johannes@xxxxxxxxxxxxxxxx> wrote:
>
> On Thu, 2018-10-04 at 12:59 -0700, Cody Schuffelen wrote:
> >
> > I wasn't completely clear on whether I should change the title (net-next
> > to mac80211-next) so I left it as is for v3 to try to keep the patchwork
> > series together.
>
> You can/should change it - patchwork doesn't really track this at all
> anyway.

Got it, thanks.

>
> > The driver is also now a bool instead of a tristate to use __ro_after_init.
>
> Hmm? Why would that be required? __ro_after_init works fine in modules?

My mistake, you're right. Tested and this does work with a module.

> > +static struct ieee80211_rate bitrates_2ghz[] = {
> > + { .bitrate = 10 },
> > + { .bitrate = 20 },
> > + { .bitrate = 55 },
> > + { .bitrate = 60 },
> > + { .bitrate = 110 },
> > + { .bitrate = 120 },
> > + { .bitrate = 240 },
> > +};
>
> Come to think of it, the typical order here would be 1,2,5.5,11,6,12,24
> (6<->11), due to the ordering in the probe request frame I guess.
>
> I'm not sure it matters though.

Thanks, changed.

>
> > +static struct ieee80211_supported_band band_2ghz = {
>
> These can be const, I think?

Unfortunately, the arrays they're assigned to in the wiphy is non-const:
https://github.com/torvalds/linux/blob/master/include/net/cfg80211.h#L4055
https://github.com/torvalds/linux/blob/master/include/net/cfg80211.h#L346

struct ieee80211_supported_band *bands[IEEE80211_NUM_BANDS];

> > +/** Assigned at module init. Guaranteed locally-administered and unicast. */
>
> I think you should avoid ** - it's the kernel-doc marker.

Good catch. Sorry about that, you pointed it out on an earlier version
as well on a different line.

>
> > +static u8 fake_router_bssid[ETH_ALEN] __ro_after_init = {};
>
> If this is the reason for not allowing it to be a module then you don't
> need to disallow the module case.

Good point, fixed.

> > +
> > +static void virt_wifi_scan_result(struct work_struct *work)
> > +{
> > + const union {
> > + struct {
> > + u8 tag;
> > + u8 len;
> > + u8 ssid[8];
> > + } __packed parts;
> > + u8 data[10];
> > + } ssid = { .parts = {
> > + .tag = WLAN_EID_SSID, .len = 8, .ssid = "VirtWifi" }
> > + };
>
> Not sure I see much value in the union, but I don't think it matters
> much.
> (You could just cast below - (void *)&ssid, sizeof(ssid))

Good point, done.

>
> > + rtnl_lock();
> > + if (priv->scan_request) {
> > + cfg80211_scan_done(priv->scan_request, &scan_info);
> > + priv->scan_request = NULL;
> > + }
> > + rtnl_unlock();
>
> Do you need the rtnl for the priv->scan_request locking?

I've redone the structure a bit to clean this up, and don't use the
lock here anymore.

> > +static int virt_wifi_get_station(struct wiphy *wiphy,
> > + struct net_device *dev,
> > + const u8 *mac,
> > + struct station_info *sinfo)
> > +{
> > [...]
> > + sinfo->tx_packets = 1;
> > + sinfo->tx_failed = 0;
> > + sinfo->signal = -60;
>
> I think Sergey pointed out the -60 elsewhere - need to check here too,
> and maybe set in some place that you actually report dBm/mBm.

Added a comment here, it looks like even with CFG80211_SIGNAL_TYPE_MBM
this should be in dBm.
https://github.com/torvalds/linux/blob/master/include/net/cfg80211.h#L1264

> Also, I think you should only report something here if actually
> connected - Sergey pointed this out on the dump station but it applies
> here too.

Thanks, get_station and dump_station now both check if the device is connected.

> > +static void free_netdev_and_wiphy(struct net_device *dev)
> > +{
> > + struct virt_wifi_netdev_priv *priv = netdev_priv(dev);
> > + struct virt_wifi_priv *w_priv;
> > +
> > + flush_work(&priv->register_wiphy_work);
> > + if (dev->ieee80211_ptr && !IS_ERR(dev->ieee80211_ptr)) {
> > + w_priv = wiphy_priv(dev->ieee80211_ptr->wiphy);
> > + w_priv->being_deleted = true;
> > + flush_delayed_work(&w_priv->scan_result);
> > + flush_delayed_work(&w_priv->connect);
> > + flush_delayed_work(&w_priv->disconnect);
>
> this is called from
>
> > +static void virt_wifi_setup(struct net_device *dev)
> > +{
> > + ether_setup(dev);
> > + dev->netdev_ops = &virt_wifi_ops;
> > + dev->priv_destructor = free_netdev_and_wiphy;
> > +}
>
> the destructor, but I believe the destructor is invoked with the RTNL
> held. As such, this will deadlock (and lockdep should complain - at
> least in current kernel versions) when it's actually running when you
> flush it, since you flush something that's (potentially) waiting to
> acquire the RTNL, but you are holding the RTNL here and so neither side
> can make progress.

Good catch, this was not a good strategy. I think it was stable for me
only because RTNL is technically not held when the destructor is
invoked by netdev_run_todo, though I see now that's only because
netdev_run_todo is a special case with respect to the RTNL mutex so
interacting with it there seems like a bad idea.
https://github.com/torvalds/linux/blob/master/net/core/dev.c#L8735

I've changed the asynchronous strategy: at a high level, it now sets
the device in a "down"/"being deleted" state to block any further
asynchronous requests, cancels pending delayed work operations, and
cleans up dangling callbacks the delayed work operations had committed
to doing. More functions are also now annotated with which locks
should already be held and which will be acquired and released. This
relies heavily on cancel_delayed_work_sync. My assumption here is that
cancel_delayed_work_sync puts the delayed work operations in a
definite "not running" state, and uses the return value to determine
if it needs to clean up any callbacks.

> > + /* The newlink callback is invoked while holding the rtnl lock, but
> > + * register_wiphy wants to claim the rtnl lock itself.
> > + */
> > + schedule_work(&priv->register_wiphy_work);
>
> Maybe we should fix/change that?

Worked around it for this implementation by sharing one wiphy across
all net_device instances. The module init and destroy now acquire the
rtnl mutex to init/destroy the wiphy as well as registering the rtnl
device type. The netdev setup is now more sane as the wiphy setup is
done ahead of time. The only downside is that multiple wifi devices
cannot scan simultaneously, so no great loss.

I've uploaded the new version here: https://lkml.org/lkml/2018/11/21/251