Re: High contention on the sk_buff_head.lock

From: Eric Dumazet
Date: Wed Mar 18 2009 - 15:07:39 EST

Next message: Oren Laadan: "Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZdo?"
Previous message: Jaswinder Singh Rajput: "Re: [tip:tracing/ftrace] tracing: fix oops intracepoint_update_probe_range()"
In reply to: Vernon Mauery: "High contention on the sk_buff_head.lock"
Next in thread: Vernon Mauery: "Re: High contention on the sk_buff_head.lock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Vernon Mauery a écrit :
> I have been beating on network throughput in the -rt kernel for some time
> now. After digging down through the send path of UDP packets, I found
> that the sk_buff_head.lock is under some very high contention. This lock
> is acquired each time a packet is enqueued on a qdisc and then acquired
> again to dequeue the packet. Under high networking loads, the enqueueing
> processes are not only contending among each other for the lock, but also
> with the net-tx soft irq. This makes for some very high contention on this
> one lock. My testcase is running varying numbers of concurrent netperf
> instances pushing UDP traffic to another machine. As the count goes from
> 1 to 2, the network performance increases. But from 2 to 4 and from 4
> to 8,
> we see a big decline, with 8 instances pushing about half of what a single
> thread can do.
>
> Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
> both NetXen and Broadcom, with very similar results), I can only push about
> 1200 Mb/s. Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
> 6000 Mb/s. But still not as much as I think is possible. I was curious and
> decided to see if the mainline kernel was hitting the same lock, and using
> /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
> number one contended lock).
>
> So while this issue really hits -rt kernels hard, it has a real effect on
> mainline kernels as well. The contention of the spinlocks is amplified
> when they get turned into rt-mutexes, which causes a double context switch.
>
> Below is the top of the lock_stat for 2.6.29-rc8. This was captured from
> a 1 minute network stress test. The next high contender had 2 orders of
> magnitude fewer contentions. Think of the throughput increase if we could
> ease this contention a bit. We might even be able to saturate a 10GbE
> link.
>
> lock_stat version 0.3
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> class name con-bounces contentions waittime-min
> waittime-max waittime-total acq-bounces acquisitions
> holdtime-min holdtime-max holdtime-total
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> &list->lock#3: 24517307 24643791 0.71
> 1286.62 56516392.42 34834296 44904018
> 0.60 164.79 31314786.02
> -------------
> &list->lock#3 15596927 [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
> &list->lock#3 9046864 [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
> -------------
> &list->lock#3 6525300 [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
> &list->lock#3 18118491 [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
>
>
> The story is the same for -rt kernels, only the waittime and holdtime
> are both
> orders of magnitude greater.
>
> I am not exactly clear on the solution, but if I understand correctly,
> in the
> past there has been some discussion of batched enqueueing and
> dequeueing. Is
> anyone else working on this problem right now who has just not yet posted
> anything for review? Questions, comments, flames?
>

Yes we have a known contention point here, but before adding more complex code,
could you try following patch please ?

[PATCH] net: Reorder fields of struct Qdisc

dev_queue_xmit() needs to dirty fields "state" and "q"

On x86_64 arch, they currently span two cache lines, involving more
cache line ping pongs than necessary.

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, dev_queue)=0x60

After patch :

offsetof(struct Qdisc, dev_queue)=0x38
offsetof(struct Qdisc, state)=0x48
offsetof(struct Qdisc, q)=0x50

Signed-off-by: Eric Dumazet <dada1@xxxxxxxxxxxxx>

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..e24feeb 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -51,10 +51,11 @@ struct Qdisc
u32 handle;
u32 parent;
atomic_t refcnt;
- unsigned long state;
+ struct netdev_queue *dev_queue;
+
struct sk_buff *gso_skb;
+ unsigned long state;
struct sk_buff_head q;
- struct netdev_queue *dev_queue;
struct Qdisc *next_sched;
struct list_head list;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Oren Laadan: "Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZdo?"
Previous message: Jaswinder Singh Rajput: "Re: [tip:tracing/ftrace] tracing: fix oops intracepoint_update_probe_range()"
In reply to: Vernon Mauery: "High contention on the sk_buff_head.lock"
Next in thread: Vernon Mauery: "Re: High contention on the sk_buff_head.lock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]