Re: Questions about EROS

shapj@us.ibm.com
Thu, 24 Jun 1999 10:45:11 -0400


I've included Iain's entire mail below, as he has asked some very good questions
in a very clear way, but neglected to copy the cap-talk list. Apologies for
length to the Linux list. I'm going to answer his file system question
separately, because there are a lot of issues buried in the question that need
to get teased out.

** On massive temporary computation state:

Iain is correct that there is a serious issue here. If your computation modifies
all of memory, and does so in a truly random way, and it does so fast enough,
and you have managed to get the physical memory on the machine tuned so that it
just exactly holds the problem without doing I/O, checkpointing is going to be a
problem. I have several reactions to this, none of which really solve the
problem from Iain's point of view.

Before I suggest the workarounds, I want to note that Iain's problem identifies
a very peculiar border case in which the problem "just fits" in the memory.
Were the problem slightly smaller or slightly larger the issues would be quite
different. I submit that if you tune your machine carefully enough you can
arrive at a similar issue on almost any operating system, and the fact that
checkpointing gets you at one very specific point in the optimization space
isn't all that compelling a problem.

Also, I want to note that this class of applications simply isn't what EROS was
designed to do well.

Given which, and agreeing that in his scenario there is a problem, here are the
workarounds that are possible in EROS:

1. First, the checkpoint interval is something you can change. For the
application that Iain mentions this might well be an appropriate thing to do.
You still might not want to turn it off entirely. Consider running a larger
version of his same computation. If the computation runs for several days, you
might well be willing to spend 4 to 5 minutes over the course of those days
checkpointing the state as insurance against the possibility that the machine
will crash.

2. You can by a little more memory. Memory just isn't that expensive.

3. For enough money, Norm Hardy and I might actually get around to implementing
the "lightcones" proposal for variable rate checkpointing. Purely as a research
matter I'ld be interested in doing so, but that won't help you this week. This
proposal was generated in response to an inquiry from someone at NASA, who
wanted to run wind tunnel simulations (these have similar access properties to
the ones that Iain describes).

4. You might consider restructuring the access behavior of the application, as
it's killing the D-cache as well. Depending on your app, this may or may not be
feasible.

Ultimately, this goes back to some of the earlier comments about tradeoffs. The
designs of KeyKOS and EROS are biased for fault tolerance, decomposability, and
global consistency, because for the target applications we needed to address
these were mission critical requirements. Iain has identified an application
for which none of these criteria are important, but the last byte of memory is
precious. Given the falling cost of memory and the fact that no system can
successfully optimize for everything, I think that we made a reasonable choice
even if it isn't right for the problem Iain wants to solve.

And, as I say, we have a design that we think will solve the problem. If
pressed, I'ld have to admit that implementing that design is not presently at
the top of the queue.

** Software bugs and robustness versus execution states

Once again, an insightful question (that's two for two, Iain; keep it up, 'cause
the crowd is going wild in the stands :-).

Speaking for myself, I have never taken very seriously the notion that
persistence eliminates the need for serialization. Setting aside the robustness
issue, there is still a need for data interchange and network interaction, and
that invariably requires serialization. There does appear to be a "cut line"
that can be drawn in decomposed systems, and many services do not appear to need
to do their own serialization.

That said, let's answer your question.

First, it's useful to ask "What is the source of the robustness?" in your MAGIC
example. I suggest that it's not the distinction between files and memory.
Rather, it's the fact that the file-based form sits on the other side of a
protection boundary and uses a simpler representation. The protection boundary
eliminates random pointer problems. The simpler representation increases the
likelihood that the data is stored consistently.

In KeyKOS, the "file" object is a carefully implemented process. It does indeed
have to watch its pointer references closely, but no more or less so than the
Linux file system does. Ultimately, the usage pattern in EROS is expected to be
simpler -- there is neither need nor benefit to having a separate instance of
the editor for every document. Actually, this would impede data interchange
rather badly.

So where is the benefit of persistence?

1. Fast recovery.

2. Rapid response for key applications. Compare the cost of switching to the
address book in the Palm Pilot to the cost of bringing up the Palm Desktop
application every time you wish to use it.

3. Fault tolerance. Applications divided into pieces fail in more controllable
and recoverable ways.

4. Multiparty collaboration. You and I can set up a family of programs that
collectively implement controlled interaction between us. This is frequently a
complicated business, and consistency across processes becomes important.

5. Raw performance. Except in the corner case that you have already identified,
transparent persistence is simply less invasive and faster.

Perhaps Norm Hardy or Charlie Landau will have things to add about this (if so,
guys, please respond to BOTH LISTS!!)

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595

Iain McClatchie <iainmcc@ix.netcom.com> on 06/23/99 08:10:19 PM

To: Jonathan S Shapiro/Watson/IBM@IBMUS
cc: linux-kernel@vger.rutgers.edu
Subject: Questions about EROS

Jonathan,

The discussion about EROS has been quite entertaining. I have three
questions.

1. Massive temporary computation caches
2. Software bugs and robust versus execution states
3. Allocation policy

-----
Massive temporary computation caches

I run an application which uses Binary Decision DAGs (BDDs) in the
optimization of netlists. On one small testcase, this application
grabs 100 MB of memory, which it overwrites almost 1000 times over the
course of 10 minutes. When I imagine running this application with
the BDDs in persistent storage, checkpointed once every five minutes,
I imagine one of two things will happen:

Either the application will grab less than half of the available
physical RAM, in which case half the RAM will store the checkpoint
being written and the other half will store the active process state,
or the application will come to a dead stop every five minutes when it
runs out of memory attempting to copy 96MB of its COW pages. It will
stay stopped like this for 10 to 15 seconds every 5 minutes, which is
about a 5% slowdown, or half the difference between a $550 500MHz PIII
and a $330 450MHz PIII. Either way it appears that persistence will
cost approximately $100 for this admittedly small testcase.

It seems that the checkpointing of massive temporary computation
caches is fairly expensive. Do you folks on EROS have some ideas
about how to mitigate this cost? For instance, how would an
application like Navigator cache the decompressed versions of the JPEG
pictures it presents frequently?

-----

Software bugs and robust versus execution states

I use another wonderful application, MAGIC, to edit the layout for my
circuits. MAGIC crashes every month or two due to software problems,
and yet I have little difficulty maintaining, uncorrupted, the state
of hundreds of interconnected layout cells. It appears to me that
this ease of use stems from having two representations of the layout
cells: one on disk, in ASCII, which is terse, simple, and robust, and
another in memory which is larger, complex, fast, but more prone to
accidental corruption.

How do persistent object systems contain software errors? At first
glance, it seems to me that you would need, at the very least, some
form of this dual data representation, along with its attendant
operator-directed translations between the two, and a way of putting
at least one representation into a seperately controlled, global
namespace. Having done this, you've recapitulated the programming
effort of having a file system and interfaces to that file system from
every program. Wasn't avoiding this effort one of the proposed
benefits of a persistent object system in the first place?

-----

Allocation policy

A few months ago we had a wonderful debate on linux-kernel about the
goodness of reiserfs, which is a new filesystem being developed by
Hans Reiser. I took away from that debate a notion that any decently
designed filesystem can find and read data already on the disk in a
minimum of seeks. The number of seeks, and the overall performance of
the filesystem is determined by their allocation policy: how they lay
the data down on the disk.

The filesystem in a persistent object system has the same problem
presented to it as the filesystem in a UNIX system, except (a) it is
also expected to store many short-lived objects, and (b) the entire
contents of memory are regularly swapped to disk. It seems that both
(a) and, to a lesser extent, (b) would form a kind of noise would tend
to swamp the signal that your filesystem allocator is looking for. If
you think of the allocator as being something like the malloc()
algorithm in a garbage collector, the programs in a UNIX system mostly
seperate the long-lived objects from the short-lived objects before
they get to the allocator.

Does EROS have some sort of system for implicitly or explicitly
recognising long-lived versus short-lived objects?

-Iain
iainmcc@ix.netcom.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/