Split brain

Split Brain? Never Again! A New Solution for an Old Problem: DRBD Quorum

While attending OpenStack Summit in Atlanta, I sat in a talk about the difficulties of implementing High Availability (HA) clusters. At one point, the speaker presented a picture of a split-brain, discussed the challenges in resolving them, and implementing STONITH in certain environments. As many of you know, “split-brain” is a condition that can happen when each node in a cluster thinks that it is the only active node. The system as a whole loses grip on its “state”; nodes can go rogue, and data sets can diverge without making it clear which one is primary. Data loss or data corruption can result, but there are ways to make sure this doesn’t happen, so I was interested in probing further.

Fencing is not always the solution

Split brain

The Split brain problem can be solved by DRBD Quorum.

To make it more interesting, it turned out that the speaker’s company uses DRBD and Pacemaker for HA, a setup that is very familiar to us. After the talk, I approached the speaker and recommended that they consider “fencing” as a way to avoid split-brain. Fencing regulates access to a shared resource and can be a good safeguard. As the resource needs separate communication path best practices suggest not using the same one that it is trying to regulate, so it needs a separate communication path. Unfortunately, in his environment, redundant networking was not possible. We needed another method.

Split brain is solved via DRBD Quorum

After talking to the speaker, it was clear to me that a new option for avoiding split brain or diverging data sets was needed since existing solutions may not always be feasible in certain infrastructures. This got me thinking about the various options for avoiding split-brain and how fencing could be implemented by using the built-in communication found in DRBD 9. It turns out that the capability of mirroring more than two nodes, found in DRBD 9 is a viable solution.

That idea sparked the work on the newest feature in DRBD: Quorum.

Shortly thereafter, the LINBIT team developed and integrated a working solution into DRBD. The code was pushed to the LINBIT repository and ready for testing.

Interest was almost immediate!

Later on, I happened to meet a few folks from IBM UK. They were working on IBM MQ Advanced Software, the well-known messaging middleware software that helps integrate applications and data across multiple platforms. They intended to use DRBD for their replication needs and quickly became interested in the idea of using a Quorum mechanism to mitigate split-brain situations.

DRBD Quorum takes new perspective

IBM LogoThe DRBD Quorum feature takes a new approach to avoiding data divergence.  A cluster partition may only modify the replicated data set if the number of nodes that can communicate is greater than half of the overall number of nodes within the defined cluster. By only allowing writes on a node that has access to over half the nodes in a given partition, we avoid creating a diverging data set.

The initial implementation of this feature would cause any node that lost Quorum (and was running the application/data set) to be rebooted.  Removing access to the data set is required to ensure the node stops modifying data. After extensive testing, the IBM team suggested a new idea that instead of rebooting the node, terminate the application. This action would then trigger the already available recovery process, forcing services to migrate to a node with Quorum!

Attractive alternative to fencing

As usual, the devil is in the details. Getting the implementation right with the appropriate resync decisions was not as straightforward as one might think. In addition to our own internal testing, many IBM engineers also tested it as well. We are happy to report that current implementation does exactly what was expected!

Bottom line:

If you need to mirror your data set three times, the new DRBD Quorum feature is an attractive alternative to hardware fencing.

In case you want to learn more about the Quorum implementation in DRBD
please see the DRBD9 user’s guide:
https://docs.linbit.com/docs/users-guide-9.0/#s-feature-quorum
https://docs.linbit.com/docs/users-guide-9.0/#s-configuring-quorum

Image  (Lloyd Fugde – stock.adobe.com)

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

 

 

2 replies
  1. Don
    Don says:

    Thank you for the interesting post, I am very curious about the following.

    Unless I missed something, the drawback of this solution is that all data is stored 3 times.

    I wonder why DRBD doesn’t provide a solution like Corosync does with qdevice/qnetd (or a variation of that). One could then even run the (as far as DRBD is concerned) diskless third/arbitrator daemon/node on a Raspberry Pi for example. In this case, one could think of a system in which a split brain (and therefore dataloss/corruption) could be absolutely 100% guaranteed not to occur in any circumstance.

    I think that DRBD could fill a gap in the SDS market, but the whole split brain issue is a huge risk in my opinion and storing data 3 times heavily changes the economics.

    Kindest regards, Don

    Reply
    • Phil
      Phil says:

      The difference here is that Corosync is stateless and DRBD is stateful. In the sense that Corosync just needs to answer the question if the partition has the quorum, while a partition/group of DRBD nodes only have the quorum when the participating nodes agree on the data-generation.

      Well. I guess that is a bit hard to understand. Let me give you an example that shows that it does not work with a diskless arbitrator node. Let’s start our experiment of thought with a 3 node DRBD cluster. Two nodes having a backing disk, one is diskless. Let’s say one of the diskful nodes gets isolated. Now the other diskful and the diskless form a partition and let’s assume we consider those two to have the quorum. This allows them to alter the dataset. Let’s call that data-generation-2. — Complete power outages, all 3 nodes down. — Power returns. Now the diskful node that is still on data-generation-1 and the diskless node form a partition. Our weak algorithm now grants the quorum to these two. You see the problem? — Does not happen if all 3 nodes have a copy of the complete data set.

      One could come up with a clever scheme where a diskless node gets some tiny meta-data in which it can persist which data-generation it enabled to have the quorum. Then you avoid the problem just described in the example above.So, yes, room for improvement. No, is not on the short/mid term road map.

      Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *