Re: Interesting analysis of linux kernel threading by IBM

From: Peter Rival (frival@zk3.dec.com)
Date: Thu Jan 20 2000 - 17:06:10 EST


Mark Hahn wrote:

> > > > bigger thing for Alpha than for Intel). On our newer systems we fully expect
> > > > hundreds, if not thousands, of tasks. The more commercially accepted Linux
> > >
> > > the issue is *runnable* tasks. do your machines routinely report
> > > loadaverages of 1000? if so, I'm impressed!
> > >
> >
> > Yes, I did get that. :) And yes, they do routinely report such load averages. As
> > a matter of fact, I was just stress-testing a system and when I looked the load
> > average was 2002. Granted the test I was running (AIM) is artificial (all tests
> > of its ilk are), it was designed as a representative measure of system
> > performance. Representative in that it represents what systems actually do.
>
> AIM is certainly NOT representative of what most systems do.
> many of us have administered and used no many overloaded multiuser
> machines, and they never look like that. it's perfectly reasonable
> for Linux to say that if you are among the .01% of machines that have
> runqueues that long, try this patch to sched.c...
>

Agreed that AIM is not representative directly of what "most" systems do. Maybe a
few. But that's one of the reasons that we can play with the workfiles - to better
simulate something that might happen in real life. Enough of that - no benchmark
completely accurately simulates what happens in real life. Nuff said.

>
> > Do we here have systems with that high a load average from actual use? No - but
> > we're not running any massive databases or web servers either. Do I have hard
> > proof that our customers do that? I doubt anyone would tell me if they did anyway
> > ;)
>
> without proof that this is an issue, it's not an issue.
> seriously, I have a hard time thinking of why a massive DB or webserver
> would ever have large numbers of runnable processes (not blocked on IO.)
>

Bad examples. *shrug*

>
> > Point is, it's possible, and it's becoming more and more probable every day. (I
> > had a co-worker testing a system and he gave up at a load average of somewhere
> > around 16000 because he didn't want to wait any longer.) Remember - this is the
>
> high loadaverages are a sign that something's wrong. servers should be
> IO-bound.
>

A high load average can also be a sign that something is right. Our common argument
here is Ebay or somesuch (think of something like the US House of Representatives web
site when the Lewinsky paper hit). Something to be concerned about is when the massive
hit of unexpected traffic occurs and being IO bound is not the only problem. In other
words, don't make things _worse_ than they already are just because of the scheduler.

>
> > age of server consolidation - more stuff on fewer systems. Lots of little piddly
> > things on tiny boxen scattered here there and everywhere was the NT way and it has
> > proven to not work (on a large scale at the very least).
>
> this is the age of clustering, actually. please, let's not talk ourselves
> into reliving the horrors of mainframes.
>

First, they're already here (both again and still). Second, yes this is the age of
clustering as well. Sometimes now we even get to do them both inside the same box.

>
> > > the issue here is whether someone can come up with a maintainable
> > > scheduler that has the requisite performance. since the runqueue is
> > > normally short, the scheduler's performance function must have a
> > > very small constant term. if it's true that there are applications
> > > that result in long runqueues, then the performance curve needs to
> > > be as flat and horizontal as possible, again, without degrading the
> > > constant term.
> > >
> >
> > Agreed. That's one of the reasons that Tru64 has a per-cpu runqueue backed by a
>
> per-cpu scalability is very much a good thing. I expect 2.5.[1-4] will
> contain massive per-cpu patches. but this doesn't imply that the scheduler
> should be optimized for many runnable tasks!
>

I didn't mean to imply that we should optimize it for many runnable tasks. I just want
something that doesn't penalize that fact as badly as we do now. Phil Ezolt has
already displayed several cases where we currently fall over in the high end due in
large part to our O(n) searches of the runqueue.

>
> > > AFAIK, loopback volanomark does not resemble _any_ real application.
> > >
> >
> > No, probably not technically. It just means that they didn't have to configure a
> > big system with a whole bunch of clients and enough bandwidth to put the same type
> > of load on the system. With the advent of multi-hundred MB/s plus Internet
> > connections and massive Intranet requirements, such bandwidth isn't impossible to
> > imagine.
>
> the issue is that volano consists of many threads, each blocked only by
> talking to other threads. if any real app were designed like this,
> you'd stream "STUPID!" into the face of its designers. real apps sometimes
> use many threads, but those threads wind up blocked on IO most of the time.
> similarly, big multiuser systems are normally blocked on brain IO.
>

Good enough. I thought you were talking about the fact it was a loopback test rather
than true client-server.

>
> rather than prematurely optimizing the scheduler, which seems to run
> just fine for real loads, it might be interesting to look at scaling
> _wakeup_, that is when many threads are going into or coming out of
> an IO sleep. *that* is something that real, big machines might need.
>

That's fine to go with for now. But I refuse to blindly walk away from the scheduler
after seeing some of the profiling data from it. Maybe some of this has changed in
2.3...maybe not. But the fact that what we currently have is non-optimal for larger
systems should not be ignored just because that's not what most of us use right now.

 - Pete

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:23 EST