Re: Auto-Adaptive scheduler - Final chapter ( the numbers ) ...

From: Marco Colombo (marco@esi.it)
Date: Tue Feb 01 2000 - 08:53:54 EST


On Mon, 31 Jan 2000, Jamie Lokier wrote:

> Larry McVoy wrote:
> > : _If_ we see that low RQ applications don't switch much, then we can
> > : assert that low RQ applications aren't adversely affected by the patch.
> >
> > Huh? What about I/O servers? FTP, Web, etc? Are they not low RQ and
> > frequently switching applications? Yes, they are.
>
> If they are frequently switching, there must be several processes to
> switch between and there must be more than one runnable at a time. So I
> assume you mean the classic "one process per connection" FTP server and
> the classic "spawns CGI scripts" web server.
>
> I'm not sure where the run queue values tend to be in that case. If RQ
> < 1, then the switching is to/from the idle task, in which case idle
> time can absorb switching cost (at some latency cost). If 1 < RQ < 2,
> is there really that much switching going on? Etc.

I'm not sure of this. The RQ tends to grow when the CPUs are overloaded.
If the CPUs are overloaded, there's little you can do but add more or
change the rules of the game (which is, either optimize the application
by "doing the same" using less CPU cycles, or completely re-design it to
use the CPU in a completely different manner).
Consider the example of a web server (multi-threaded) with static pages,
on a fast network:
- start with 0 load. All threads are sleeping. RQ is 0, switch rate is 0.
[ just to give numbers, these come from a vmstat 5 on a mostly idle server:
 r b w swpd free buff cache si so bi bo in cs us sy id
 0 0 0 7032 1612 6728 17736 0 0 0 1 110 15 0 1 99
]
- under low load, if the idle time is >0, we don't really care, since
  (by definition) we have CPU cycles to waste (latency is not considered).
  BTW, that's the precise job of the idle() task in a Unix-like OS, as
  you can find on books (in RL, the idle task does not loop, i think).
- as the load (number of requests) increases, you reach the point at which
  you have idle = 0. In this example we reach it sooner or later since
  the job is CPU bound (by assumpion it does no I/O).
  At this precise point, if the network is *fast*, and pages are small,
  you serve many of them. We're assuming the the CPU is the bottleneck,
  here, so the more powerful the CPU, the more pages you serve.
  The more pages you serve, the more switches you have. The RQ is still close
  to 1 (in a perfect world it would be exacly 1).
- if the load increases above this point, you hit on a HW limit, which is
  the CPU. I'll consider this later.

At the ideal point, i think the server implementation makes little
difference. At, say, 10000 pages/sec, switch rate would be around 10000.
Using a single-threaded server will bring it down close to 0, but what
for? Multiple connections handling does have an overhead. It's this
overhead vs. scheduler overhead (with RQ close to 1). Which one is better,
depends a lot. Both application designs are good here. On an OS that
supports a fast and efficient switching, but has a bad support for many
connections (poll overhead) the multi-threaded one would perform better.
On a OS with a good poll(), and bad scheduler, the single-threaded wins.
[and I'm not sure I'm considering all types of server design, BTW]

I think the kernel designers goal should be to support both choises at
best. Which is, have a fast scheduler *and* a good poll().
But we are now in kernel land, and we all know that perfect implementations
do not exist. And there are trade-offs to be made. A scheduler may be
made as fast as possible, but only assuming tha RQ is quite small.
And a poll() implementation may make other assumptions of the same kind
(i'm not familiar with its current implementation, sorry).

Now, let's go back to our ideal server condition, and increase load.
The multi-threaded design "reacts" increasing the RQ. That's because
you're hitting on HW limit (I assumed you already have an optimized
implementation, and you can't go further). When the RQ is 2, I'd call
this an overload situation. [Here, RL diverges a lot from our ideal case,
I'm afraid. But we can reason only on models anyway... B-)].
The proof is that, buying other CPU or a faster CPU, you get a linear
performance gain (still assuming that the CPU is the bottleneck also
in the new configuration, of course).

Now let me make an impossible (in RL) assumption, in sake of a simpler
comparison. Let's assume that the poll() overhead and the scheduler
overhead are the same. I mean, let's assume that also the single-threaded
server reaches it's ideal condition (idle time close to 0) at the same
workload (10000 pages/sec).
Also this server has a queue. It's its connections queue. On light load,
the server sleeps in poll(), so idle time adsorbs the poll() overhead.
At the ideal point, the server never sleeps, and the average number
of active connections is 1.
Now, as load increases, how will the server react? The number of
active connections will increase. As i've said above, i'm not familiar
with poll(). But the problem is the same.
The scheduler has more than 1 active threads to choose among more than
10000 present.
The poll() has more than 1 active fds to choose among more than 10000
present.
[Implicit assumption here. Every client is requesting a page every
second, on average. I don't claim this is a sensible assumption.
It's just a model, after all...]

Which server implementation is better here againg falls in kernel land.
How does the scheduler react when the RQ starts to grow?
How does the poll() implementation react, when the number of active
connections grows? Do they both perform a fast but linear scan?

Now, let's make a jump. Just consider a 10x load. Number of threads
goes to 100000, and RQ to 10 or more. Number of fds goes to 100000, and
active connections to 10. Well, in the first place, you should realize
that you're really *overloading*. A 10x more than your ideal condition.
That's your main problem. You need 10x CPU power. There's no software
solution for this kind of problem.

Ok, now take my hand and let's jump again. We go 100x load. I won't do
the calculations for you, but note that in the first case RQ is 100 now.
Don't you feel ridiculous? I do. That poor CPU is tring to do the job
of 100. I've just replaced your (0% idle) 650MHz Athlon with a 486SX/25,
and we're *benchmarking* it. And making serious thinking on optimizing
the scheduler a bit for such a condition.

That's enough of it. This execise was just an illustration. It was not
made to demonstrate anything. It's just an artistic picture of
what my feelings are on the matter.

I see the RQ as a flag. Every time it goes over 1, it means:
"I need more CPU!". If this happens at times, for short, it's ok.
If your system RQ is continuosly > 1, you know that you may consider
buying another CPU. If it is > 2, you know that you *definitely*
should buy it. If you have the money, of course. [On SMP, you
weight the RQ by the number N of CPUs, of course: so 1 is N and 2 is
N + 1].

I do think that the *final* scheduler patch, the Mother of All Scheduler
Patches would be the following:
"Everytime the RQ exceeds N + 1, produce the money to buy another CPU."
I'm sorry, I don't know the linux kernel internals enough to implement it.
Someone else has to. Larry? Alan? Linus? B-) B-) B-)

.TM.

P.S.

RL is much different, because the CPU is not going to be the only bottleneck.
There's no "flag" for I/O overload, neither for MM excessive pressure,
not for cache thrashing. The latter is subtle. You measure easily
CPU overload by inspecting the RQ. vmstat (or other tools) may give
you hints to identify a I/O or MM problem. But a cache-miss does
make the CPU look busy, while it is not. On modern systems, the idle time
is NOT at all a measure of CPU idle cycles. So you may find your CPU
100% busy in user+system time, while it's 90% idle always waiting
for its caches to fill. One way to measure this is to monitor CPU temp,
maybe. A 100% busy CPU which has a low temp may indicate a cache problem
(but is it true? CPU gurus please say a word here). Another way is
(as someone already suggested) to turn off the cache and compare
the performance... both are empirical.

-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@ESI.it

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Feb 07 2000 - 21:00:06 EST