Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

From: John Fastabend
Date: Thu Sep 03 2020 - 21:30:35 EST


Cong Wang wrote:
> On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
> >
> > On 2020/9/3 9:48, Cong Wang wrote:
> > > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
> > >>
> > >> On 2020/9/3 8:35, Cong Wang wrote:
> > >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
> > >>>>
> > >>>> On 2020/9/2 12:41, Cong Wang wrote:
> > >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
> > >>>>>>
> > >>>>>> On 2020/9/2 2:24, Cong Wang wrote:
> > >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
> > >>>>>>>>
> > >>>>>>>> Currently there is concurrent reset and enqueue operation for the
> > >>>>>>>> same lockless qdisc when there is no lock to synchronize the
> > >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> > >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> > >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> > >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> > >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> > >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> > >>>>>>>
> > >>>>>>> Can you be more specific here? Which call path requests a smaller
> > >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> > >>>>>>> we already have a synchronize_net() there.
> > >>>>>>
> > >>>>>> When the netdevice is in active state, the synchronize_net() seems to
> > >>>>>> do the correct work, as below:
> > >>>>>>
> > >>>>>> CPU 0: CPU1:
> > >>>>>> __dev_queue_xmit() netif_set_real_num_tx_queues()
> > >>>>>> rcu_read_lock_bh();
> > >>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
> > >>>>>> .
> > >>>>>> . dev->real_num_tx_queues = txq;
> > >>>>>> . .
> > >>>>>> . .
> > >>>>>> . synchronize_net();
> > >>>>>> . .
> > >>>>>> q->enqueue() .
> > >>>>>> . .
> > >>>>>> rcu_read_unlock_bh() .
> > >>>>>> qdisc_reset_all_tx_gt
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> Right.
> > >>>>>
> > >>>>>
> > >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> > >>>>>> too.
> > >>>>>>
> > >>>>>> The problem we hit is as below:
> > >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> > >>>>>> to deactive the netdevice when user requested a smaller queue num, and
> > >>>>>> txq->qdisc is already changed to noop_qdisc when calling
> > >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> > >>>>>> netif_set_real_num_tx_queues() does not help here.
> > >>>>>
> > >>>>> How could qdisc still be running after deactivating the device?
> > >>>>
> > >>>> qdisc could be running during the device deactivating process.
> > >>>>
> > >>>> The main process of changing channel number is as below:
> > >>>>
> > >>>> 1. dev_deactivate()
> > >>>> 2. hns3 handware related setup
> > >>>> 3. netif_set_real_num_tx_queues()
> > >>>> 4. netif_tx_wake_all_queues()
> > >>>> 5. dev_activate()
> > >>>>
> > >>>> During step 1, qdisc could be running while qdisc is resetting, so
> > >>>> there could be skb left in the old qdisc(which will be restored back to
> > >>>> txq->qdisc during dev_activate()), as below:
> > >>>>
> > >>>> CPU 0: CPU1:
> > >>>> __dev_queue_xmit(): dev_deactivate_many():
> > >>>> rcu_read_lock_bh(); qdisc_deactivate(qdisc);
> > >>>> q = rcu_dereference_bh(txq->qdisc); .
> > >>>> netdev_core_pick_tx(dev, skb, sb_dev); .
> > >>>> .
> > >>>> . rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> > >>>> . .
> > >>>> . .
> > >>>> . .
> > >>>> . .
> > >>>> q->enqueue() .
> > >>>
> > >>>
> > >>> Well, like I said, if the deactivated bit were tested before ->enqueue(),
> > >>> there would be no packet queued after qdisc_deactivate().

Trying to unwind this through git history :/

Original code had a test_bit in dev_xmit_skb(),

if (q->flags & TCQ_F_NOLOCK) {
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
__qdisc_drop(skb, &to_free);
rc = NET_XMIT_DROP;
} else {
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
__qdisc_run(q);
}

if (unlikely(to_free))
kfree_skb_list(to_free);
return rc;
}

So we would never enqueue something on top of a deactivated qdisc. And to ensure
we don't have any in-flight enqueues we have to swap qdiscs, wait a grace
period, and then reset the qdisc. That _should_ be OK.

But, I'm still not entirely sure how you got here. In the drivers I did I always
stop the queue before messing with these things with netif_tx_stop_queue(). Then
we really should not get these packets into the driver.

In sch_direct_xmit():

if (likely(skb)) {
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (!netif_xmit_frozen_or_stopped(txq))
skb = dev_hard_start_xmit(skb, dev, txq, &ret);

HARD_TX_UNLOCK(dev, txq);
} else {
if (root_lock)
spin_lock(root_lock);
return true;
}

Maybe I missed something? Does your driver use the netif_tx_stop/start APIs?

> > >>
> > >> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> > >> otherwise there is still window between setting and testing the deactivated bit.
> > >
> > > Can you be more specific here? Why testing or setting a bit is not atomic?
> >
> > testing a bit or setting a bit separately is atomic.
> > But testing a bit and setting a bit is not atomic, right?
> >
> > cpu0: cpu1:
> > testing A bit
> > setting A bit .
> > . .
> > . qdisc enqueuing
> > qdisc reset
> >
>
> Well, this was not a problem until commit d518d2ed8640c1cbbb.
> Prior to that commit, qdsic can still be scheduled even with this
> race condition, that is, the packet just enqueued after resetting can
> still be dequeued with qdisc_run().
>
> It is amazing to see how messy the lockless qdisc is now.
>
> Thanks.