Re: Low Latency

From: Robert Dinse (nanook@eskimo.com)
Date: Sun Jul 02 2000 - 23:40:54 EST


On Sun, 2 Jul 2000, Andrew Morton wrote:

> Date: Sun, 02 Jul 2000 23:24:32 +1000
> From: Andrew Morton <andrewm@uow.edu.au>
> To: Robert Dinse <nanook@eskimo.com>, Paul Barton-Davis <pbd@Op.Net>
> Cc: linux-kernel@vger.rutgers.edu
> Subject: Re: Low Latency
>
> Kai Henningsen wrote:
> >
> > So: IMAO, the right way is (1) fix those areas where you can see that
> > there is obviously a local problem
>
> Yes.
>
> Linus is okay with the addition of a few tasteful and carefully designed
> conditional_reschedule() calls to work around this issue in 2.4.
>
> Robert & co are okay with an _occasional_ scheduling delay - with a
> sufficiently powerful and correctly set up machine, it could well be
> that a tiny subset of the low-latency patch will fix their problems.
>
> So the way forward is to identify the worst
> areas where these stalls are occuring and try to cleanly fix them.
> Surely?
>
> I've done a few measurements on a 2x500MHz machine. Both of the
> attached files show results across a kernel build. All times (min, max,
> avg, total) are in microseconds.
>
> syscalls.txt shows the amount of wall-clock time a CPU spent in a
> particular system call.
>
> copy_usr.txt shows the elapsed time which the system spent in
> copy_to_user and copy_from_user. Again, this is wall-clock time, so any
> interrupt which occurs will add to these times.
>
> These numbers aren't particularly useful yet, but they can be improved
> when I understand the problem better.
>
> They do show that the longest time that was ever spent in copy_*_user()
> was 570 uSecs, which would suggest that this was simply a convenient
> place for the conditional_reschedule rather than being a CPU pig in
> itself.
>
> They also show that sys_brk can sometimes take a very long time. I
> really don't see why this should be, as at no time during the build did
> I have less than 50Mbytes of free memory.
>
> I'll try stopping & restarting the instrumentation within
> schedule(). This will give the _real_ scheduling latency figures.
> (eek. That's going to take a lot of coding. Several days worth).
>
> Does anyone actually _know_ where the problems are occurring?
>
> Has anyone incrementally applied the low-latency patch to see which
> fixes are the most effective?
>
> Robert, Paul: test cases, please. Can you please tell me what system
> activity causes the most problems? Is there a particular system
> activity or application which will reliably kill scheduling latency? If
> I can't reproduce it here will you be able to apply a truckload of
> patches (against 2.4.0) so we can get to the bottom of it?
>
> Generally speaking, can we please all stop shouting at each other and,
> umm, y'know, do some technical stuff?

     Andrew,

     You are the first person thus far that seems willing to approach this
from a reasonable perspective.

     I thought I made it clear in my original post that I had just tried these
out of curiousity, not because I had a hard real-time need, and found that they
improved the responsiveness of systems under load considerably.

     For me it's the difference between being able to type an 'ls -l' on a
system that is serving 1 million web hits/day and doing 30-40 concurrent ftp
sessions and getting a damned near instantaneous response and then getting the
prompt back immediately, as opposed to typing an 'ls -l' under the same
conditions and having the system wait about five seconds, then spew everything
to my screen, and then wait another 3 or 4 seconds, and come back with a
prompt.

     Ingo's patch DOES make this system less stable but UP systems seem to
remain stable with it in place.

     Maybe it's just me but I think the way a system responds subjectively
plays a significant role in whether someone is going to use it or not.

     I don't have any hard measurements to provide; I know the audio people did
provide some a while back. If you have suggestions for how to make such
measurements and it's something doable, then I'm open to spending a little time
to do so.

     Wouldn't time spent in copy_*_user depend upon variables like CPU and bus
speed, the size of the chunk being read/written, etc? In other words, if I
read/write to a 300MB file in 20MB chunks, it's going to spend more time than
if I do so in 1k blocks. If I've got a 20Mhz Sbus (and I do on some boxes),
it's going to be a lot slower than on a machine with a 133 Mhz bus. If I'm
doing something with a 40Mhz CPU it's going to take a lot longer than someone
using a 1Ghz CPU.

     So measurements are going to be relative to the hardware used, and I am
not suggesting for a moment that someone should expect 1-5ms latency on a 25mhz
386 box, but for a given hardware environment, critical applications aside,
lower latency is going to make the system feel more responsive and more usable
under a given load, which translates into capable of doing more useful work
because you CAN run it at a higher load and still maintain good response.

     So any reasonable suggestions on how to measure this and determine where
problems exist?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Jul 07 2000 - 21:00:11 EST