[PATCH 0/3] netfilter : 3 patches to boost ip_tables performance

From: Eric Dumazet
Date: Wed Sep 21 2005 - 16:25:36 EST


Hi

I have reworked net/ipv4/ip_tables.c to boost its performance, and post three patches.

In oprofile results, ipt_do_table() was at the first position.
It is now at 6th position, using 1/3 of the CPU it was using before.
(Tests done on a dual Xeon i386 and a dual Opteron x86_64)

Short description :

1) No more central rwlock protecting each table (filter, nat, mangle, raw),
but one lock per CPU. It avoids cache line ping pongs for each packet.

2) Loop unrollings and various code optimizations to reduce branches
mispredictions.

3) NUMA aware allocation of memory (this part was posted earlier, but got some polishing problems)


Long description:

1) No more one rwlock_t protecting the 'curtain'

One major bottleneck on SMP machines is the fact that one rwlock
is taken in ipt_do_table() entry and exit. That 2 atomic operations are
the killer, and even if multiple readers can work together on the same table
(using read_lock_bh()), the cache line containing rwlock still must be
taken exclusively by each CPU at entry/exit of ipt_do_table().

As existing code already use separate copies of the data for each cpu, it is
very easy to convert the central rwlock to separate rwlocks, allocated
percpu (and NUMA aware).

When a cpu enters ipt_do_table(), it can locks its local copy, touching a
cache line that is not used by other cpus. Further operations are done on
'local' copy of the data.

When a sum of all counters must be done, we can write_lock each part at a
time, instead of locking all the parts, reducing the lock contention.

2) Loop unrolling

It seems that with current compilers and CFLAGS, the code from
ip_packet_match() is very bad, using lot of mispredicted conditional branches I made some patches and generated code on i386 and x86_64
is much better.

3) NUMA allocation.

Part of the performance problem we have with netfilter is memory allocation
is not NUMA aware, but 'only' SMP aware (ie each CPU normally touch
separate cache lines)

Even with small iptables rules, the cost of this misplacement can be high
on common workloads.

Instead of using one vmalloc() area (located in the node of the iptables
process), we now allocate an area for each possible CPU, using NUMA policy
(MPOL_PREFERRED) so that memory should be allocated in the CPU's node
if possible.

If the size of ipt_table is small enough (less than one page), we use
kmalloc_node() instead of vmalloc(), to use less memory and less TLB entries)
in small setups.

Note1 : I also optimized get_counters(), using a SET_COUNTER() for the first cpu, avoiding a memset() and ADD_COUNTER() if SMP on other cpus.

Note2 : This patch depends on another patch that declares sys_set_mempolicy()
in include/linux/syscalls.h
( http://marc.theaimsgroup.com/?l=linux-kernel&m=112725288622984&w=2 )


Thank you
Eric Dumazet
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/