Re: [Bug #11308] tbench regression on each kernel release from 2.6.22-> 2.6.28

From: Eric Dumazet
Date: Mon Nov 17 2008 - 06:22:25 EST


Ingo Molnar a écrit :
* David Miller <davem@xxxxxxxxxxxxx> wrote:

From: Ingo Molnar <mingo@xxxxxxx>
Date: Mon, 17 Nov 2008 10:06:48 +0100

* Rafael J. Wysocki <rjw@xxxxxxx> wrote:

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
Submitter : Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx>
Date : 2008-08-11 18:36 (98 days old)
References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
http://marc.info/?l=linux-kernel&m=122125737421332&w=4
Christoph, as per the recent analysis of Mike:

http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html

all scheduler components of this regression have been eliminated.

In fact his numbers show that scheduler speedups since 2.6.22 have offset and hidden most other sources of tbench regression. (i.e. the scheduler portion got 5% faster, hence it was able to offset a slowdown of 5% in other areas of the kernel that tbench triggers)
Although I respect the improvements, wake_up() is still several orders of magnitude slower than it was in 2.6.22 and wake_up() is at the top of the profiles in tbench runs.

hm, several orders of magnitude slower? That contradicts Mike's numbers and my own numbers and profiles as well: see below.

The scheduler's overhead barely even registers on a 16-way x86 system i'm running tbench on. Here's the NMI profile during 64 threads tbench on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:

Throughput 3437.65 MB/sec 64 procs
==================================
21570252 total ........
1494803 copy_user_generic_string 998232 sock_rfree 491471 tcp_ack 482405 ip_dont_fragment 470685 ip_local_deliver 436325 constant_test_bit [ called by napi_disable_pending() ]
375469 avc_has_perm_noaudit 347663 tcp_sendmsg 310383 tcp_recvmsg 300412 __inet_lookup_established 294377 system_call 286603 tcp_transmit_skb 251782 selinux_ip_postroute 236028 tcp_current_mss 235631 schedule 234013 netif_rx 229854 _local_bh_enable_ip 219501 tcp_v4_rcv

[ etc. - see full profile attached further below ]

Note that the scheduler does not even show up in the profile up to entry #15!

I've also summarized NMI profiler output by major subsystems:

NET overhead (12603450/21570252): 58.43%
security overhead ( 1903598/21570252): 8.83%
usercopy overhead ( 1753617/21570252): 8.13%
sched overhead ( 1599406/21570252): 7.41%
syscall overhead ( 560487/21570252): 2.60%
IRQ overhead ( 555439/21570252): 2.58%
slab overhead ( 492421/21570252): 2.28%
timer overhead ( 226573/21570252): 1.05%
pagealloc overhead ( 192681/21570252): 0.89%
PID overhead ( 115123/21570252): 0.53%
VFS overhead ( 107926/21570252): 0.50%
pagecache overhead ( 62552/21570252): 0.29%
gtod overhead ( 38651/21570252): 0.18%
IDLE overhead ( 0/21570252): 0.00%
---------------------------------------------------------
left ( 1349494/21570252): 6.26%

The scheduler's functions are absolutely flat, and consistent with an extreme context-switching rate of 1.35 million per second. The scheduler can go up to about 20 million context switches per second on this system:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
32 0 0 32229696 29308 649880 0 0 0 0 164135 20026853 24 76 0 0 0
32 0 0 32229752 29308 649880 0 0 0 0 164203 20032770 24 76 0 0 0
32 0 0 32229752 29308 649880 0 0 0 0 164201 20036492 25 75 0 0 0

... and 7% scheduling overhead is roughly consistent with 1.35/20.0.

Wake up affinities and data flow caching is just fine in this workload - we've got scheduler statistics for that and they look good too.

It all looks like pure old-fashioned straight overhead in the networking layer to me. Do we still touch the same global cacheline for every localhost packet we process? Anything like that would show up big time.

Yes we do, I find strange we dont see dst_release() in your NMI profile

I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387
net: make sure struct dst_entry refcount is aligned on 64 bytes)
(in net-next-2.6 tree)
to properly align struct dst_entry refcounter and got 4% speedup on tbench on my machine.

Small speedups too with commit ef711cf1d156428d4c2911b8c86c6ce90519dc45
(net: speedup dst_release())

Also on net-next-2.6, patches avoid dirtying last_rx on netdevices (loopback for example)
, it helps a lot tbench too.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/