Re: Fw: Some very thought-provoking ideas about OS architecture.

Steve Underwood (steveu@netpage.com.hk)
Mon, 21 Jun 1999 09:56:22 +0000


>>> [Alan Cox:] Another peril is that external interfaces don't always
like
replay of events.
>>
>>[Eric Raymond:] A much more serious objection, I agree.

>[Jonathan Shapiro:]
> There is a more fundamental issue: replaying events to external
interfaces
> violates security, because the connections need to be reauthenticated.
EROS
> drivers are part of the kernel, which is outside of the checkpoint
contract, so
> the issue you are getting at doesn't really apply where devices are
concerned.
> Low-level code (e.g. the net stack) must be written with awareness of
the
> checkpoint mechanism. The operating system explicitly rescinds all
external
> connections on restart in order to ensure that reauthentication occurs

> correctly.

There seems tremendous vagueness in this. You highlight further problems
than Alan highlighted, but give no explanation of a cure. Replaying any
events is impossible - the rest of the world just won't play ball.
Picking up where things left off is also impossible - you don't know
where that was; you only know about the last checkpoint. Clearly network
connections must be recinded, but what do you do about that? My
understanding is that when the machine comes back up it will step back
precisely to the last checkpoint, and run from there. I guess all its
TCP connections will suddenly show an error and die, and the software
must reconnect to the rest of the universe. Right? What then happens to
other stateful external interactions? Do you have some mechanism that
will cause all external stateful activities to die, so they don't
continue in an undefined way? I'm thinking of UDP interactions;
interfaces with custom equipment; printing; and so on. Custom interfaces
would be very difficult to deal with in a totally generalised way.
Persistence is a curse, not a benefit, when applied to these things.

>[Jonathan Shapiro:]
> The real win, and the answer to Alan's question, is that it's not
about large
> sequential writes. It's about bulk block transfers. As you modify
small
> objects, changes are written to a write-ahead log, which is where the
disk head
> spends most of it's time. Empirically, you are rarely more than three
tracks
> from where you want to be. This significantly reduces seek time. Also,
the log
> is logically append-only, which means that few writes (even if they
are to
> different objects) require a seek at all. In this sense, there is
indeed a
> large sequential block write occuring.
>
> Eventually, the log fills up, a checkpoint transaction is declared,
and a bulk
> transfer of the data to it's home locations occurs. What Eric
describes as a
> sequential block write isn't really what happens here. What really
happens is
> that you reach into the log, pull out an overflowing handful of blocks
to be
> moved to their home location, and then transfer them in a single arm
pass over
> the disk. In practice, many of the mutates are clustered, so you end
up doing
> seeks only between clusters. Essentially it's a sorted bulk transfer.
If the
> machine fails during transfer you are okay because a complete snapshot
exists in
> the checkpoint log.

This description is very write oriented. What about reads? In a real
world situation the machine doesn't just sit there processing, changing
its data, and needing writes to make it persistent. Most applications
manage a large data set, and are forever hoping around that data. If the
RAM is much smaller than the disk (like in anything but a TPC benchmark
test) there will be lots of disk reading. In most systems this is
through explicit reads. I assume in Eros there is a tremendous number of
page breaks and VM swaps (am I wrong?). Doesn't this rather pull the
large disk block operation concept apart?

Another issue. Does the read/write clustering offer that much benefit
with modern disk drives? They keep bringing down the seek times, as the
head assemblies get smaller and light, but they can't damp the vibration
much faster. The seek times are approaching the settling times, so
clustering the reads and write around a few tracks doesn't have the
performance benefit it used to have (though it might help with wear and
tear). Writing a large stream of sectors certainly has benefits, but
does writing non-contiguous sectors in a small area of the disk still
offer much over widely spaced ones?

Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/