Re: POHMELFS high performance network filesystem. Transactions,failover, performance.

From: Sage Weil
Date: Wed May 14 2008 - 12:10:21 EST


On Wed, 14 May 2008, Jamie Lokier wrote:
> > Similarly, if only 1 out of 3 replicas is surviving, most people want to
> > be able to read their data, while Paxos demands a majority to ensure it is
> > correct.
>
> (Generalising to any "quorum" (majority vote) protocol).
>
> That's true if you require that all results are guaranteed consistent
> or blocked, in the event of any kind of network failure.
>
> But if you prefer incoherent results in the event of a network split
> (and those are often mergable later), and only want to protect against
> media/node failures to the best extent possible at any given time,
> then quorum protocols can gracefully degrade so you still have access
> without a majority of working nodes.

Right. In my case, I require guaranteed consistent results for critical
cluster state, and use (slightly modified) Paxos for that. For file data,
I leverage that cluster state to still maintain perfect consistency in
most failure scenarios, while also degrading gracefully to a read/write
access to a single replica.

When problem situations arise (e.g., replicating to A+B, A fails,
read/write to just B for a while, B fails, A recovers), an administrator
can step in and explicitly indicate we want to relax consistency to
continue (e.g., if B is found to be unsalvageable and a stale A is the
best we can do).

> In that model, neighbour sensing is used to find the largest coherency
> domains fitting a set of parameters (such as "replicate datum X to N
> nodes with maximum comms latency T"). If the parameters are able to
> be met, quorum gives you the desired robustness in the event of
> node/network failures. During any time while the coherency parameters
> cannot be met, the robustness reduces to the best it can do
> temporarily, and recovers when possible later. As a bonus, you have
> some timing guarantees if they are more important.

Anything that silently relaxes consistency like that scares me. Does
anybody really do that in practice?

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/