[sh4][2.6.17] latency peaks with unix sockets on heavy loads

From: guillaume ranquet
Date: Mon Sep 22 2008 - 12:03:54 EST



I'm experiencing little glitches when I try to send/receive datas
with local sockets under heavy loads.
under normal load it behaves normally, but with load increasing, I
get latency peaks once every second.
an Image is worth a thousand words:
http://img255.imageshack.us/img255/3700/capplottsy7.th.png
X: time elapsed from beginning of execution
Y: call latency
red: under heavy load
green: no load at all
those peeks of 200ms really disturbs me as I'm using the sockets for
RPC calls and 200ms (and far more with load increasing) is really too
much.
I've been testing various things
enabling/disabling kernel preempt : no effects
active waiting (doing some stuff that consumes cpu) between RPC
calls: no effects (even worse)
non blocking sockets doesn't show any improvements and never yield
for a EWOULDBLOCK
setting policy to SCHED_FIFO solves the problem:
http://img47.imageshack.us/img47/4449/capplotschedfifoxh5.th.png
also, adding a usleep(0) between each call (still with SCHED_NORMAL
policy) removes the peaks
from my understanding, usleep(0) puts the task in sleeping mode
until the next TICK is emitted and may cause a context switch if
there's another runnable task
sched_yield()'ing once every 1000 calls helps also greatly (some
peaks still appears here and there though)

upgrading to 2.6.23:
h00rray it solves everything:
http://img371.imageshack.us/img371/7028/capplotkernel2623lldnl6.th.png
still the mean time is a bit higher and provokes a 30% overhead at
running the test
but my problem is that I can't upgrade my kernel (yet) and need to
find a solution on 2.6.17
I couldn't reproduce the behavior of the 2.6.17 with the 2.6.23, no
matter the kernel config
what has changed and could impact on that 'glitch' between the 2
kernels:
-lock classes of AF_UNIX domain has became bh-unsafe :: seems out of
suspicion since the peaks hasn't shown up with AF_INET sockets
-scheduler for SCHED_NORMAL tasks has been completely rewritten ::
seems to be guilty of the new (improved?) behavior

is that a known bug of the pre-CFS scheduler?
am I totally wrong and should not blame the scheduler?

is there a solution with 2.6.17 and SHED_NORMAL?

ps: since I'm not subscribed (my e-mail account can't handle the traffic),
would you please CC me?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/