Re: POHMELFS high performance network filesystem. Transactions, failover,performance.

From: Jeff Garzik
Date: Wed May 14 2008 - 15:04:00 EST


Evgeniy Polyakov wrote:
Hi Sage.

On Wed, May 14, 2008 at 06:35:19AM -0700, Sage Weil (sage@xxxxxxxxxxxx) wrote:
What is your opinion of the Paxos algorithm?
It is slow. But it does solve failure cases.
For writes, Paxos is actually more or less optimal (in the non-failure cases, at least). Reads are trickier, but there are ways to keep that fast as well. FWIW, Ceph extends basic Paxos with a leasing mechanism to keep reads fast, consistent, and distributed. It's only used for cluster state, though, not file data.

Well, it depends... If we are talking about single node perfromance,
then any protocol, which requries to wait for authorization (or any
approach, which waits for acknowledge just after data was sent) is slow.

Quite true, but IMO single-node performance is largely an academic exercise today. What production system is run without backups or replication?


If we are talking about agregate parallel perfromance, then its basic
protocol with 2 messages is (probably) optimal, but still I'm not
convinced, that 2 messages case is a good choise, I want one :)

I think part of Paxos' attraction is that it is provably correct for the chosen goal, which historically has not been true for hand-rolled consensus algorithms often found these days.

There are a bunch of variants (fast paxos, byzantine paxos, fast byzantine paxos, etc., etc.) based on Classical Paxos which make improvements in the performance/latency areas. There is even a Paxos Commit which appears to be more efficient than the standard transaction two-phase commit used by several existing clustered databases.

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/