Lets get this right (WAS RE:MOSIX and kernel mods)

Albert D. Cahalan (acahalan@cs.uml.edu)
Sat, 6 Mar 1999 06:41:46 -0500 (EST)


Sean Hunter writes:

> We need to all put the guns down and think a bit.

You too. Don't claim to be impartial; you are far from it.

> Moving to a distributed approach to computing is a fundamental change,
> and if we make the wrong design and architecture decisions now, we
> will be stuck with them for a while. If commercial linux apps come to
> rely on a distributed API we provide, we had better be sure that it
> doesn't suck

With "suck" being "mess up how the kernel works", yes.

> To summarise, two lines of argument have emerged so far in this
> debate:
>
> 1)"DSM sucks because it encourages bad code. It shouldn't go in"

"...and everybody should be forced to write kick-ass code that uses
every last cycle, without consideration for programmer time.
This is how Real Men (like myself of course) write code, and, as
everyone knows, I am always right."

> For what its worth, I am firmly in the former camp. Linux as an OS is
> all about excellence.

Excellence means supporting nearly everything.

> It doesn't have to cater to commercial
> considerations about time(or cost)-to-market because of the way in
> which development is done.

So you think Linux is not suitable for a commercial environment?
Would you include researchers without your copious free time?
I'm sure cluster computing is popular with home users.

> It doesn't have to provide a second-best
> solution because that's all "the development team" has been able to
> come up with.

This is nonsense. You would be almost right discussing kernel internals,
but our NFS and ext2 are both second-best solutions. I'm sure Linus has
a long list of things that could use improvement, but we use the existing
code anyway because we need it.

First-best kernels support second-best userspace software.

> It doesn't have to put something into the kernel just
> because it makes life easier for some people in the short term, and we
> all know that we can't afford to accept sloppiness just because people
> say "Well, it's a compile option, you can just leave it out."

What sloppiness? Do you claim that the kernel code is sloppy, or do
you claim that the kernel should only support perfect userspace code?

> We need a solution that we can live with for the next ten years, not
> one that works now, and then we have to ditch at the next major
> release with much gnashing of teeth.

FUD, FUD, FUD. (Why?) Distributed memory won't need to be ditched.

> I'd like to ask (and then discuss) some questions:
>
> 1)Is distributed IPC a useful abstraction?
> 2)Is distributed shared memory a necessary part of that abstraction?
> 3)Does this of necessity _have_ to be part of the kernel?

Yes, yes, yes.

> Question 1. This is about "why do we want to do this?" Are we trying
> to do it because:
>
> a)it sounds hip in a press release?
> b)it'll get me my phd?
> c)I'll have a bit of my own code in the kernel, girls will think I'm
> attractive etc etc
> d)There are real benefits to real applications.

Mostly (d), though the others don't hurt.

> Why might we want to go distributed? The obvious answers are:
> a)speed - My job is inherently paralell and I just can't get enough
> raw processing on a single node.
> b)use of resources - I've got all these computers sitting idle when
> they could be predicting the stock market for me.
>
> Now, if we want to go distributed to make things run faster, then we'd
> better be sure that our task is suited to it, that our program is
> well-written and our API is fast, otherwise we have a performance loss
> when we wanted a performance gain.

NO WAY!!!

Let's say you have 3 choices:

1. 95% use of a single machine
2. 80% use of a 16-way cluster, but 2 weeks development time
3. 60% use of a 16-way cluster, but 2 hours development time

The job takes a month to run on one machine, but you need results
within a week. Only option 3 gets the job done, and you can even
run the job twice in case your cluster loses power and trashes
all the filesystems. (option 3 is distributed memory)

> If you just want to use up spare cycles on boxes on the network, and
> don't care whether or not it's fast, ask yourself "Does this really

You have the bad assumption that "fast" must always mean "fastest
possible, without concern for anything else". You are wrong.
Instead, "fast" is a relative term. (with an implied reference to
what is considered normal)

> Question 2: This is about "Do our apps need to know about the

Since you didn't answer this, I'll do it for you: yes. People often
need to reuse code and skills. Human time is limited, while the
computer can spend all weekend running imperfectly optimized code.

> Question 3: This is two parts: "Will this tie our hands in the
> future?", and it's corollary: "Will this benefit us so much that we
> will still be prepared to support it when we aren't writing our phd?"

This being Linux, people tend to support things.

> I don't mean to sound cynical, but everyone who has worked in a large

You sound like a fudmeister actually.

> Finally, if you don't read anything else I have written read this.
> Ask yourself "Is this the best we can do?". Why should we settle for
> anything less that the best solution? We have some of the best

The best solution: support userspace as completely as possible.

It is simply cruel to force your idea of perfection on people who
have completely different needs than you do.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/