Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30

From: Daniel Phillips
Date: Wed Jul 07 2004 - 20:12:57 EST


On Wednesday 07 July 2004 14:16, Lars Marowsky-Bree wrote:
> On 2004-07-06T17:34:51, Daniel Phillips <phillips@xxxxxxxxxx> said:
> > > And the "industry" was very reluctant
> > > too. Which meant that everybody spend ages talking and not much
> > > happening.
> >
> > We're showing up with loads of Sistina code this time. It's up to
> > everybody else to ante up, and yes, I see there's more code out
> > there. It's going to be quite a summer reading project.
>
> Yeah, I wish you the best. There's always been quite a bit of code to
> show, but that alone didn't convince people ;-) I've certainly grown
> a bit more experienced / cynical during that time. (Which, according
> to Oscar Wilde, is the same anyway ;)

OK, what I've learned from the discussion so far is, we need to avoid
getting stuck too much on the HA aspects and focus more on the
cluster/performance side for now. There are just too many entrenched
positions on failover. Even though every component of the cluster is
designed to fail over, that's just a small part of what we have to deal
with:

- Cluster Volume management
- Cluster configuration management
- Cluster membership/quorum
- Node Fencing
- Parallel cluster filesystems with local semantics
- Distributed Locking
- Cluster mirror block device
- Cluster snapshot block device
- Cluster administration interface, including volume managment
- Cluster resource balancing
- bits I forgot to mention

Out of that, we need to pick the three or four items we're prepared to
address immediately, that we can obviously share between at least two
known cluster filesystems, and get them onto lkml for peer review.
Trying to push the whole thing as one lump has never worked for
anybody, and won't work in this case either. For example, the DLM is
fairly non-controversial, and important in terms of performance and
reliability. Let's start with that.

Furthermore, nobody seems interested in arguing about the cluster block
devices either, so lets just discuss how they work and get them out of
the way.

Then let's tackle the low level infrastructure, such as CCS (Cluster
Configuration System) that does a simple job, that is, it distributes
configuration files racelessly.

I heard plenty of fascinating discussion of quorum strategies last
night, and have a number of papers to read as a result. But that's a
diversion: it can and must be pluggable. We just need to agree on how
the plugs work, a considerably less ambitious task.

In general, the principle is: the less important it is, the more
argument there will be about it. Defer that, make it pluggable, call
it policy, push it to user space, and move on. We need to agree on the
basics so that we can manage network volumes with cluster filesystems
on top of them.

> > I can believe it. What I have just done with my cluster snapshot
> > target over the last couple of weeks is, removed _every_ dependency
> > on cluster infrastructure and moved the one remaining essential
> > interface to user space.
>
> Is there a KS presentation on this? I didn't get invited to KS and
> will just be allowed in for OLS, but I'll be around town already...

There will be a BOF at OLS, "Cluster Infrastructure". Since I didn't
get a KS invite either and what remains is more properly lkml stuff
anyway, I will go canoing with Matt O'Keefe during KS as planned. We
already did the necessary VFS fixups over the last year (save the
non-critical flock patch, which is now in play) so there is nothing
much left to beg Linus for. There are additional VFS hooks that would
be nice to have for optimization, but they can wait, people will
appreciate them more that way ;)

The non-vfs cluster infrastructure just uses the normal module API,
except for a couple of places in the DM cluster block devices where
I've allowed myself some creative license, easily undone. Again, this
is lkml material, not KS stuff.

> > It looks like fencing is more of an issue, because having several
> > node fencing systems running at the same time in ignorance of each
> > other is deeply wrong. We can't just wave our hands at this by
> > making it pluggable, we need to settle on one that works and use
> > it. I'll humbly suggest that Sistina is furthest along in this
> > regard.
>
> Your fencing system is fine with me; based on the assumption that you
> always have to fence a failed node, you are doing the right thing.
> However, the issues are more subtle when this is no longer true, and
> in a 1:1 how do you arbitate who is allowed to fence?

Good question. Since two-node clusters are my primary interest at the
moment, I need some answers. I think the current plan is: they try to
fence each other, winner take all. Each node will introspect to decide
if it's in good enough shape to do the job itself, then go try to fence
the other one. Alternatively, they can be configured so that one has
more votes than the other, if somebody wants that broken arrangement.

This is my dim recollection, I'll have more to say when I've actually
hooked my stuff up to it. There are others with plenty of experience
in this, see below.

> > Cluster resource management is the least advanced of the components
> > that our Red Hat Sistina group has to offer, mainly because it is
> > seen as a matter of policy, and so the pressing need at this state
> > is to provide suitable hooks.
> >
> > "STOMITH" :) Yes, exactly. Global load balancing is another big
> > item, i.e., which node gets assigned the job of running a
> > particular service, which means you need to know how much of each
> > of several different kinds of resources a particular service
> > requires, and what the current resource usage profile is for each
> > node on the cluster. Rik van Riel is taking a run at this.
>
> Right, cluster resource management is one of the things where I'm
> quite happy with the approach the new heartbeat resource manager is
> heading down (or up, I hope ;).

Combining heartbeat and resource management sounds like a good idea.
Currently, we have them separate and since I have not tried it myself
yet, I'll reserve comment. Dave Teigland would be more than happy to
wax poetic, though.

> > It's a huge, scary problem. We _must_ be able to plug in different
> > solutions, all the way from completely manual to completely
> > automagic, and we have to be able to handle more than one at once.
>
> You can plug multiple ones as long as they are managing independent
> resources, obviously. However, if the CRM is the one which ultimately
> decides whether a node needs to be fenced or not - based on its
> knowledge of which resources it owns or could own - this gets a lot
> more scary still...

We do not see the CRM as being involved in fencing at present, though I
can see why perhaps it ought to be. The resource manager that Lon
Hohberger is cooking up is scriptable and rule-driven. I'm sure we
could spend 100% of the available time on that alone. My strategy is,
I send my manually-configurable cluster bits to Lon and he hooks them
in so everything is automagic, then I look at how much the end result
sucks/doesn't suck.

There's some philosophy at work here: I feel that any cluster device
that requires elaborate infrastructure and configuration to run is
broken. If you can set the cluster devices up manually and they depend
only on existing kernel interfaces, they're more likely to get unit
testing. At the same time, these devices have to fit well into a
complex infrastructure, therefore the manual interface can be driven
equally well by a script or C program, and there is one tiny but
crucial additional hook to allow for automatic reconnection to the
cluster if something bad happens, or if the resource manager just feels
the need to reorganize things.

So while I'm rambling here, I'll mention that the resource manager (or
anybody else) can just summarily cut the block target's pipe and the
block target will politely go ask for a new one. No IOs will be
failed, nothing will break, no suspend needed, just one big breaker
switch to throw. This of course depends on the target using a pipe
(socket) to communicate with the cluster, but even if I do switch to
UDP, I'll still keep at least one pipe around, just because it makes
the target so easy to control.

It didn't start this way. The first prototype had a couple thousand
lines of glue code to work with various possible infrastructures. Now
that's all gone and there are just two pipes left, one to local user
space for cluster management and the other to somewhere out on the
cluster for synchronization. It's now down to 30% of the original size
and runs faster as a bonus. All cluster interfaces are "read/write",
except for one ioctl to reconnect a broken pipe.

> > Incidently, there is already a nice crosssection of the cluster
> > community on the way to sunny Minneapolis for the July meeting.
> > We've reached about 50% capacity, and we have quorum, I think :-)
>
> Uhm, do I have to be frightened of being fenced? ;)

Only if you drink too much of that kluster Koolaid

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/