Re: Interesting analysis of linux kernel threading by IBM

From: David Lang (dlang@diginsite.com)
Date: Fri Jan 21 2000 - 22:50:07 EST


Another issue with SMP is just the hardware scalability, if you have not
read "in search of clusters" try and take a look at it, it goes into a LOT
of detail on exactly what is going on under the covers to make your SMP
box work and why for very large applications it becomes questionable.

David Lang

 On Thu, 20 Jan 2000, Larry McVoy wrote:

> Date: Thu, 20 Jan 2000 20:01:34 -0800
> From: Larry McVoy <lm@bitmover.com>
> To: Peter Rival <frival@zk3.dec.com>
> Cc: linux-kernel@vger.rutgers.edu
> Subject: Re: Interesting analysis of linux kernel threading by IBM
>
> : First of all, you're right.
>
> Ahhh, I love it when people suck up to me :-)
> Actually, smart move, it set my mind to actually reading the rest of your
> message. Everybody? Please start out all mail to me with "You're right".
>
> What? No? You don't think so? Ahh, well, I can hope :-)
>
> : Larry, you mentioned that you had thoughts on what Linux has to do to work on big
> : servers. I'm imagining, given your background and the content of this email, that
> : they are also ideas that won't harm the performance on small systems (e.g.
> : desktops). I must have missed them somehow - could you recap?
>
> OK, but they are pretty radical. Linus and I have talked them over and he
> has always been of the opinion "sounds good, sounds like it might be right,
> where's the code?". And I'm side tracked onto BitKeeper.
>
> Whatever, can the people who are really interested in high performance
> take a look at
>
> http://www.bitmover.com/llnl/smp.{ps,pdf}
> and then
> http://www.bitmover.com/llnl/labs.{ps,pdf}
>
> I'll briefly summarize here. No justification for these statements are
> here, there are some in the papers.
>
> Premise 1: SMP scaling is a bad idea beyond a very small number processors.
> The reasoning for this is that when you start out threading a kernel,
> it's just a few locks. That quickly evolves into more locks, and
> for a short time, there is a 1:1 mapping between each sort of object
> in the system (file, file system, device, process, etc) and a lock.
> So there can be a lot of locks, but there is only one reader/writer
> lock per object instance. This is a pretty nice place to be - it's
> understandable, explainable, and maintainable.
>
> Then people want more performance. So they thread some more and now
> the locks aren't 1:1 to the objects. What a lock covers starts to
> become fuzzy. Thinks break down quickly after this because what
> happens is that it becomes unclear if you are covered or not and
> it's too much work to figure it out, so each time a thing is added
> to the kernel, it comes with a lock. Before long, your 10 or 20
> locks are 3000 or more like what Solaris has. This is really bad,
> it hurts performance in far reaching ways and it is impossible to
> undo.
>
> Premise 2: most/all locking follows a canonical form of "take a global
> data structure, split it up into N, where N is a function of the
> number of CPUs, and give each CPU or group of CPUS, their own data
> structure. Classic example: global/local run queues.
>
> Premise 3: it is far easier to take a bunch of operating system images
> and make them share the parts they need to share (i.e., the page
> cache), than to take a single image and pry it apart so that it
> runs well on N processors.
>
> All of this leads us to an interesting twist on the clustering idea,
> one that I give credit for mostly to DEC (they have some of this
> implemented already). Suppose you were to take a single big machine
> and run another instance of the OS every N processors where N is chosen
> so that it is well under the knee of the locking curve, i.e., around 4,
> maybe 8, but certainly no more than 8.
>
> Multiple OS's on a single box? Wacky, huh? But if you think about it,
> you've just instantly taken *EVERY* data structure in the kernel and
> multi threaded it. Cool, no? And it cost you nothing but some boot
> code.
>
> That's kinda cute but not very useful because what you really want is to
> be able to have all processors working together on the same data with only
> one copy of the data. In other words, I don't care if I have one OS or
> 1000, I want all processors to be able to mmap /space/damn_big_file and
> poke at it. And I don't want any stinkin' DSM - I want real, hardware
> based coherency. Well, bucky, I'm here to tell ya, praise the lord,
> you can have it :-) You need to make an SMPFS which lets other OS's
> put reference counts on your inodes. The operation is extremely
> similar to what you have to do when you invalidate a page - you shoot
> down the other processor's TLB entries. So we need something like
> that in the reverse.
>
> If you think I'm waving my hands wildly, I am. But this is definitely
> doable, and as hard as it seems, it is easily an order of magnitude easier
> than threading the kernel to get to even 32 processors. I've lived
> through that twice, it's about a 7 year process (in hind sight; before
> hand, everyone said it would be maybe 18 months).
>
> Read the papers. Think. Think again. Let's talk. I can set up a
> perf@bitmover aliase if this becomes too off topic.
>
> --lm
>
> P.S. I call these SMP clusters, to distinguish them from HA or HPC clusters.
> SMP clusters are for the enterprise - these are the clusters that will get
> Linux on big iron running Oracle and kicking serious butt. Fast.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:27 EST