Re: iptables very slow after commit784544739a25c30637397ace5489eeb6e15d7d49

From: Stephen Hemminger
Date: Sun Apr 12 2009 - 12:06:26 EST


On Sat, 11 Apr 2009 17:34:45 -0700
"Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> wrote:

> On Sat, Apr 11, 2009 at 11:57:16AM -0700, Linus Torvalds wrote:
> >
> >
> > On Fri, 10 Apr 2009, Paul E. McKenney wrote:
> > >
> > > 1. Assuming that the synchronize_net() is intended to guarantee
> > > that the new rules will be in effect before returning to
> > > user space:
> >
> > Btw, I think that's a bad assumption.
>
> It does indeed appear to be!
>
> > The thing is, nobody can really care if the new rules are in effect or
> > not, because the thing you race with is not the "return to user space"
> > part, but the incoming packets.
> >
> > And those incoming packets might have been incoming before the rules were
> > set up too.
> >
> > So I seriously doubt you need to synchronize with any returning to user
> > space. What you want to synchronize with is then later actions that do
> > things like turning on the interface that the rules are attached to etc!
> >
> > So I would suggest:
> >
> > - remove the synchronize_net() entirely. Replace it with just freeing the
> > old rules using RCU.
> >
> > - new packets will always end up seeing the new rules. That includes the
> > case of somebody doing "ifconfig eth0 up" that enables a new source of
> > packets, so there are no real security issues.
> >
> > - if you enabled your network interfaces before you updated your packet
> > filtering rules, you already had a window where packets would come in
> > with the old rules, so doing a "synchronize_net()" in no way protects
> > against any race conditions anyway.
> >
> > Am I missing something?
>
> The issue at this point seems to be the need to get accurate snapshots
> of various counters -- there are a number of Linux networking users who
> need to account for every byte flowing through their systems. However,
> it is also necessary to update these counters very efficiently, given
> that they are updated on a per-packet basis. The current approach is
> as follows:
>
> 1. Install a new set of counters.
>
> 2. Wait for a grace period to elapse.
>
> 3. At this point, we know that all subsequent counting will happen
> on the new set of counters.
>
> 4. Add the value of the old set of counters to the new set of
> counters.
>
> 5. Copy the old set of counters up to user space.
>
> So we get a good snapshot in #5, while #4 ensures that we don't lose
> any counts when taking future snapshots. Unfortunately, #2 hits us
> with grace-period latencies on the critical path.
>
> We are going through the following possibilities:
>
> o Stick with the current approach, and ask people to move to
> new batch-oriented interfaces. However, a 30x decrease in
> performance is pretty grim, even for an old-style interface.
>
> o Use various atomic tricks to get an immediate snapshot of the
> old counters after step 1. Make step 3 use call_rcu() instead
> of synchronize_rcu(), and then step 4 happens off the
> critical path.
>
> This approach moves the RCU grace period off of the critical
> path, but the atomic tricks are extremely ugly on 32-bit SMP
> machines. 32-bit UP machines and 64-bit machines are not
> too bad, though the 32-bit UP case does add preemption-disable
> overhead on the counter-update fastpath.
>
> o Provide some sort of expedited synchronize_rcu(). This might
> be able to decrease the hit from 30x down to maybe 5x.
> But I might need to do this for the fast-boot folks anyway,
> though I am first trying to get away with just speeding
> up synchronized_rcu(). Though I was not thinking in terms
> of 6x, let alone 30x.
>
> Please note that this would not be a drop-in replacement for
> synchronize_rcu(). One would use synchronize_rcu_expedited()
> (or whatever) only when the system really could not get any
> useful work done while the grace period was in progress.
> The general approach would be to keep the whole machine busy
> trying to get the grace period done as soon as possible.
>
> Thanx, Paul

We could also try:
* per-cpu spinlock on counters (instead of synchronize_net).
When doing update, just acquire
lock on that cpu and futz with counters then. Overhead should
still be less than 2.6.29 and earlier global rwlock

* synchonize_rcu/synchronize_net is more guarantee than needed?

* use on_each_cpu() somehow to do grace periood?

* Add a cond_resched() into net_rx_action which might cause rx processing
to get out of rcu sooner? also in transmit packet scheduler.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/