Re: [PATCH 00/12] DRBD: a block device for HA clusters

From: Philipp Reisner
Date: Tue Apr 07 2009 - 11:56:38 EST


On Tuesday 07 April 2009 14:23:14 Nikanth K wrote:
> Hi Philipp,
>
> On Mon, Mar 30, 2009 at 10:17 PM, Philipp Reisner
>
> <philipp.reisner@xxxxxxxxxx> wrote:
> > Hi,
> >
> > ÂThis is a repost of DRBD, to keep you updated about the ongoing
> > Âcleanups.
> >
> > Description
> >
> > ÂDRBD is a shared-nothing, synchronously replicated block device. It
> > Âis designed to serve as a building block for high availability
> > Âclusters and in this context, is a "drop-in" replacement for shared
> > Âstorage. Simplistically, you could see it as a network RAID 1.
> >
> > ÂEach minor device has a role, which can be 'primary' or 'secondary'.
> > ÂOn the node with the primary device the application is supposed to
> > Ârun and to access the device (/dev/drbdX). Every write is sent to
> > Âthe local 'lower level block device' and, across the network, to the
> > Ânode with the device in 'secondary' state. ÂThe secondary device
> > Âsimply writes the data to its lower level block device.
> >
> > ÂDRBD can also be used in dual-Primary mode (device writable on both
> > Ânodes), which means it can exhibit shared disk semantics in a
> > Âshared-nothing cluster. ÂNeedless to say, on top of dual-Primary
> > ÂDRBD utilizing a cluster file system is necessary to maintain for
> > Âcache coherency.
> >
> > ÂThis is one of the areas where DRBD differs notably from RAID1 (say
> > Âmd) stacked on top of NBD or iSCSI. DRBD solves the issue of
> > Âconcurrent writes to the same on disk location. That is an error of
> > Âthe layer above us -- it usually indicates a broken lock manager in
> > Âa cluster file system --, but DRBD has to ensure that both sides
> > Âagree on which write came last, and therefore overwrites the other
> > Âwrite.
>
> So this difference to RAID1+NBD is required only if the DLM of the
> clustered fs is buggy?
>

No, DRBD is much more than RAID1+NBD, I had the impression that by writing
"RAID1+NBD" I can quickly communicate the big picture what DRBD is.

> > ÂMore background on this can be found in this paper:
> > Â Âhttp://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
> >
> > ÂBeyond that, DRBD addresses various issues of cluster partitioning,
> > Âwhich the MD/NBD stack, to the best of our knowledge, does not
> > Âsolve. The above-mentioned paper goes into some detail about that as
> > Âwell.
>
> It would be nice, if you can list those limitations of NBD/RAID here.
>

Ok. I will give you two simple examples:

1)
Think of a two node HA cluster. Node A is active ('primary' in DRBD speak)
has the filesystem mounted and the application running. Node B is
in standby mode ('secondary' in DRBD speak).

We loose network connectivity, the primary node continues to run, the
secondary no longer gets updates.

Then we have a complete power failure, both nodes are down. Then they
power up the data center again, but at first the get only the power circuit
of node B up and running again.

Should node B offer the service right now ?
( DRBD has configurable policies for that )

Later on they manage to get node A up and running again, now lets assume
node B was chosen to be the new primary node. What needs to be done ?

Modifications on B since it became primary needs to be resynced to A.
Modifications on A sind it lost contact to B needs to be taken out.

DRBD does that.

How do you fit that into a RAID1+NBD model ? NBD is just a block transport,
it does not offer the ability to exchange dirty bitmaps or data generation
identifiers, nor does the RAID1 code has a concept of that.

2)
When using DRBD over small bandwidth links, one has to run a resync, DRBD
offers the option to do a "checksum based resync". Similar to rsync it
at first only exchanges a checksum, and transmits the whole data block only
if the checksums differ.

That again is something that does not fit into the concepts of NBD or RAID1.

I will write down more examples if you think, that you need more justification
for yet another implementation of RAID in the kernel. DRBD does more, but DRBD
is not suitable for RAID1 on a local box.

PS: Lars Marowsky-Bree requested a GIT tree of the DRBD-for-mainline kernel
patch. I will set that up until Friday, and maintain the code there for
for the merging process.

Best,
Philipp
--
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/