Cheap Votes: DRBD Diskless Quorum

One of the most important considerations when implementing clustered systems is ensuring that a cluster remains cohesive and stable given unexpected conditions. DRBD already has fencing mechanisms and even a system of quorum, which is now capable of using a diskless arbitrator to break ties without requiring additional storage beyond that of two nodes.

Quorum and Fencing With a Healthy Dose of Reality

DRBD’s quorum implementation allows resources to vote on availability, taking into account connection state and disk state. While a DRBD cluster without quorum will allow promotion and writes on any node with “UpToDate” data, DRBD with quorum enabled adds the requirement that this node must also be in contact with either a majority of healthy nodes in the cluster, or a minimum amount of nodes as defined statically. This requires at least three nodes, and works best with odd numbers of nodes. A DRBD cluster with quorum enabled cannot become split-brain.

Fencing on the other hand, employs a mechanism to ensure node state by isolating or powering off a node in some way so that unhealthy nodes can be guaranteed to not provide services (by virtue of being assuredly offline). While the use case for fencing and quorum overlap by a large degree, fencing can automatically eject or recover misbehaving nodes, while quorum simply ensures that they cannot modify data.

It is possible to utilize scripts that are triggered in response to changes in quorum as a simple but effective fencing system via a “suicide” method — configuring a node to automatically reset or power itself off upon loss of quorum (accomplished via the “on-quorum-loss” handler in DRBD’s configuration). However, fully-fledged fencing methods via Pacemaker have much more logic behind them, can work even when the node to be fenced is entirely unresponsive, and make Pacemaker clusters “aware” of fencing actions.

The most important element to consider is that while both methods prevent split-brain conditions, quorum does not wholly and entirely replace out-of-band fencing. However, it comes extremely close, and in fact, close enough to eschew Pacemaker-based fencing in many configurations in favor of only quorum where fencing via privileged APIs (as is common in clouds) or dedicated fencing hardware (such as network PDUs or IPMI cards) is less than possible or desired.

Arbitrators!

Before now, in order for a DRBD resource to have three votes across three nodes for quorum, it needed three replicas of data. This was cost prohibitive in some scenarios, so additional logic was added to allow a diskless “arbitrator” node that does not participate in replication. Thusly, the diskless DRBD arbitrator was born.

The concept is fairly simple; rather than require a minimum of three replicas in a DRBD resource to enable quorum functionality, one can now use two replicas (or “data” nodes) with a third DRBD node in a permanently and intentionally diskless state as an “arbitrator” for breaking ties.

The same concepts of traditional DRBD quorum apply, with one significant exception: In a replica 2+A cluster, one node can be lost or disconnected without losing quorum — just like a replica 3 cluster. However, that arbitrator node cannot (on its own) participate in restoring quorum after it is lost.

The reason for this exception is simple: The arbitrator node has no disk. Without a disk, there is no way to independently determine whether data is valid, inconsistent, or related to the cluster at all because there is no data on that node to compare replicas with. While an arbitrator node cannot restore quorum to a single other inquorate data node, two data nodes may establish or re-establish quorum with each other. This is highly effective, and conquers the vast majority of quorum decisions at roughly 66% the cost of a replica 3 cluster.

Arbitrator Nodes in Action

I will not abide this level of grandstanding without a demonstration of this ability (and hopefully some revealing use cases), so below are some brief test results from a replica 2+A geo cluster. Behold:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b role:Secondary

peer-disk:UpToDate

geo-nfs-c role:Secondary

peer-disk:Diskless

As you can see, everything is happy. All of these nodes are connected and up to date. Nodes “geo-nfs-a” and “geo-nfs-b” are data nodes with disks. The node “geo-nfs-c” is a diskless DRBD arbitrator as well as a Booth arbitrator, and quorum has been enabled in this geo cluster (though that’s not reflected in this output). Geo clusters can be tricky to manage the datapath of, since they often operate outside of the scope of rapid decision-making mechanisms and even more often don’t have a method of fencing “sites” adequately. Using DRBD quorum in this case allows split-brains to be entirely prevented globally, rather than depending on several disconnected cluster controllers to manage things. This is much more stable, but requiring three sites with at least one full data replica each is very bandwidth-intensive as well as expensive. This is a perfect fit for an arbitrator node.

If we take one of the two data nodes offline, the cluster will still run. We’re still in contact with the arbitrator, and as long as we don’t lose that contact, quorum will be held:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b connection:Connecting

geo-nfs-c role:Secondary

peer-disk:Diskless

So let’s make it unhappy. If we take the majority of nodes offline this cluster will freeze, suspending I/O and protecting data from split-brain:

[email protected]:~# drbdadm status export-able

export-able role:Primary suspended:quorum

disk:UpToDate quorum:no blocked:upper

geo-nfs-b connection:Connecting

geo-nfs-c connection:Connecting

Reconnecting only the arbitrator node will not result in a quorate cluster, as that arbitrator has no way of knowing whether that data node is actually valid:

[email protected]:~# drbdadm status export-able

export-able role:Primary suspended:quorum

disk:UpToDate quorum:no blocked:upper

geo-nfs-b connection:Connecting

geo-nfs-c role:Secondary

peer-disk:Diskless

Connecting the peer data node will result in I/O resuming even if the arbitrator is still not functioning:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b-0 role:Secondary

peer-disk:UpToDate

geo-nfs-c connection:Connecting

Conclusion

I was able to use a Booth arbitrator node as a DRBD arbitrator node as well, both managing the cluster application state as well as securing the datapath against corruption with almost zero bandwidth usage beyond that of a 2N system. This is clearly a potent use-case and could not be more simple.

This new quorum mechanism could be applied identically to local high availability clusters, allowing reliable quorate systems to be established using a very low power third node. This can help to cheaply circumvent environmental problems that prevent adequate fencing, such as generic platform-agnostic deployment models, security-restricted environments, and even total lack of out-of-band fencing mechanisms (such as some public clouds or specialized hardware).

For posterity, the following DRBD configuration was used to accomplish this. Keep in mind, this was a geo cluster, so it’s using asynchronous replication (protocol A). Protocol C would be used for synchronous local replication:

# /etc/drbd.conf

global {

    usage-count yes;

}

common {

    options {

           auto-promote     yes;

           quorum           majority;

    }

}

resource export-able {

    volume 0 {

           device           minor 0;

           disk             /dev/drbdpool/export-able;

           meta-disk        internal;

    }

    on geo-nfs-a {

           node-id 0;

           address          ipv4 10.1.0.100:7000;

    }

    on geo-nfs-b {

           node-id 1;

           address          ipv4 10.2.0.100:7000;

    }

    on geo-nfs-c {

           node-id 2;

           volume 0 {

                   device       minor 0;

                   disk         none;

           }

           address          ipv4 10.3.0.100:7000;

    }

    connection-mesh {

           hosts geo-nfs-a geo-nfs-b geo-nfs-c;

           net {

                   protocol A;

           }

    }

}

David Hay on Linkedin
David Hay
Cluster Daemon at LINBIT
A long-time Linux system engineer, David Hay finds FOSS solutions to global problems as a Cluster Engineer at LINBIT. David started out with open source software back in the Linux 2.4 days, since then having planned and implemented countless clustered systems, leveraging HA and cloud technologies to great effect. When not liberating the enterprise world with free and open software, he spends his time tinkering with electronics and metalworking.
sync rate controller

DRBD and the sync-rate controller, part 3

This is an update to our previous two blog posts here and here. The goal with this post is to even further simplify the steps needed to tune the sync-rate controller. If you review the older posts, you’ll see that I omit a few options with this post, and even just simply pick some arbitrary starting values that work best in most deployments we’ve encountered.

I would also like to point out again that this is all about initial device synchronization and recovery resynchronization. This has no effect on the replication speeds which occur under normal replication when everything is in a healthy state.


Purpose of the sync rate controller

The dynamic sync-rate controller for DRBD was introduced way back in version 8.3.9. It was introduced as a way to slow down DRBD resynchronization speeds. The idea here is that if you have a write intensive application running atop the DRBD device, it may already be close to filling up your I/O bandwidth. We introduced the dynamic rate limiter to then make sure that recovery resync does not compete for bandwidth with the ongoing write replication. To ensure that the resync does not compete with application IO, the defaults lean towards the conservative side.

If the defaults seem slow to you or your use case, you can speed things up with a little bit of tuning in the DRBD configuration.

Tuning the sync rate controller

It is nearly impossible for DRBD to know just how much activity your storage and network backend can handle. It is fairly easy for DRBD to know how much activity it generates itself, which is why we tune how much network activity we allow DRBD to generate.

  • Set c-max-rate to 100% (or slightly more) than what your hardware can handle.
    • For example: if you know your network is capable of 10Gb/s, but your disk throughput is only 800MiB/s, then set this value to 800M.
  • Increase max-buffers to 40k.
    • 40k is usually a good starting point, but we’ve seen good results with anywhere between 20k to 80k.
  • Set c-fill-target to 1M.
    • Just trust us on this, and simply set it to ‘1M’.

This should be enough to get the resync rate going well beyond the defaults. Many people often tune the “c-*” sync rate controller setting, but never increase the max-buffers value. This may be partly our fault as we never mentioned it in the previous blog post, which is one reason I am revisiting this topic today.

Tuning the sync rate controller even further

Obviously, there is even further tuning we can do. Some of these, if tuned improperly, may have negative impacts on the application performance of programs writing to the DRBD device, so use caution. I would suggest starting with smaller values and working your way up if performing this tuning on production systems.

  • Set the resync-rate to ⅓ of the c-max-rate.
    • With the dynamic resync-rate controller, this value is only used as a starting point. Changing this will only have a slight effect, but will help things speed up faster.
  • Increase the c-min-rate to ⅓ of the c-max-rate.
    • It is usually advised to leave this value alone as the idea behind the dynamic sync rate controller is to “step aside” and allow application IO to take priority. If you really want to ensure things always move along at a minimum speed, then feel free to tune this a bit. As I mentioned earlier, you may want to start with a lower value and work up if doing this on a production system.
  • Set sndbuf-size and rcvbuf-size to 10M.
    • This is generally auto-tuned by the kernel, but cranking this up may help to move along the recovery resync speeds. There is also a possibility that this will lead to buffer-bloat, so tune these with caution. Again, on a production system, start with a value just a little over 4M and increase it slowly while observing the systems.

It is our hope that the information above will prove useful to some of our users and help possibly clear up some confusion regarding the resync tunables we have discussed in the past. As always, please feel free to drop us a comment below if you have any questions or anything you’d like to share.

Devin Vance on Linkedin
Devin Vance
First introduced to Linux back in 1996, and using Linux almost exclusively by 2005, Devin has years of Linux administration and systems engineering under his belt. He has been deploying and improving clusters with LINBIT since 2011. When not at the keyboard, you can usually find Devin wrenching on an American motorcycle or down at one of the local bowling alleys.

June 2019 Newsletter

What’s the Difference Between Off-Site Data Backup and Disaster Recovery (DR)?

Many businesses make the mistake of confusing “Disaster Recovery” with “off-site data backup.” Confusing the two may lead to paying a high price in the form of data loss and downtime when a disaster hits. Off-site backups are a necessary component of business continuity in the face of disaster for nearly any organization, but DR goes a step beyond – addressing service availability, not only data durability.

Off-Site Data Backup

Data backup describes the capturing of a point-in-time set of data to a local device, or to a device at another location. There are some online services that provide general data backups. Tape drives, USB drives, and network attached storage are able to save your data as well. One can typically view these backups whenever they wish, optionally restoring that data to wherever they want. Usually, the backups are taken repeatedly at regular intervals to keep up with data changes.

However, protecting your business from downtime is also very important. For example, if a critical server fails and you are running only a backup, services cannot be restored quickly. To get back up and running, the server would need replacement, data and software re-installed, and finally additional configuration applied. This could take anywhere from a few hours to a few days.

It takes a long time to restore data from off-site backups — even if employing high-throughput cloud services like AWS S3, it can still take an inordinate amount of time to copy the entire dataset over most WAN connections. Can you afford to lose days’ worth of business presence?

Disaster Recovery

Backing up your data is always imperative. Therefore, the question is whether or not you need a higher form of business continuity via Disaster Recovery. Knowing if you should also utilize Disaster Recovery in addition to backups depends on the availability requirements of your business. If you need your services and business functionality up and running as quickly as possible after a significant technical failure, or need to tolerate the complete failure of a location, then you need Disaster Recovery. Disaster Recovery prevents your company from significant losses of not only data, but also downtime.

It’s difficult to describe Disaster Recovery without first defining business continuity; a collection of policies, technologies, and practices which can be applied to survive significant destruction or unavailability of resources — often addressing a geographically significant scale. Disaster Recovery focuses on the technical aspects of business continuity.

Disaster Recovery mitigates the effects of significant loss from events such as natural disaster, a power grid failure, or human error.

In 2003, a high-voltage power line in Ohio brushed against some overgrown trees and shut down. This sort of problem usually triggers an alarm, but the alarm system had failed. As system operators were trying to diagnose the problem, three other lines also brushed into trees and switched off. The nearby lines shouldering the extra burden were overtaxed and caused a cascade of failures throughout southeast Canada and eight northeastern states.

This event – concluding to be the result of human error and equipment failures – caused 50 million people to lose power for up to two days and is recorded as the largest blackout in North American history. It cost roughly $6 billion dollars in economic losses.

Although this was a rare event, and by no means the only event that could cause a loss in data, current statistics indicate a blackout of this level will occur every 25 years. Companies using DR were able to cut over services to an already-established system and use the up-to-date data that was being actively replicated at the failed site.

You can learn more about Disaster Recovery (DR) in this ten minute whiteboard explanation here:

In short, Disaster Recovery enables your team of employees to get right back to work managing your customers’ data with very little downtime. DR keeps redundant sites as up-to-date as possible with each other, yet also available in a worst-case scenario.  

Overview

At some level, you need both data backup and Disaster Recovery. It’s important to understand what amount of downtime your business can tolerate. A few minutes? A few days? Are you prepared to suffer from data losses? Significant data losses happen under many different circumstances. It’s up to you to define the amount of loss you can accept in a worse-case scenario. For the applications you find most critical, it might make the most sense to hold live replicas off-site, and even have a mechanism to automatically fail-over your primary site to a secondary site to ensure continuity in the face of disaster.  

Learn More

Would you like to hear a more technical explanation of Disaster Recovery and High Availability? Check out LINBIT’s Disaster Recovery page or our videos and other detailed pages. We’re always happy to provide a sanity check on your DR strategy. Contact us to learn how we can help.

Disaster Recovery (DR) Explained Video

High Availability (HA) Explained Video

LINBIT’s Disaster Recovery

LINBIT HA

barcelona dawn familia sagrada

Container-Mania at Kubecon Barcelona!

Come and meet us at Kubecon in Barcelona – from May 20-23 we are in the Catalan capital!

As Cloud Native Computing Foundation’s (CNCF) flagship conference, KubeCon + CloudNativeCon Europe will bring together more than 10,000 technologists from thriving open source communities to further their education around cloud native computing. Maintainers and end users of CNCF’s hosted projects – including Kubernetes, Prometheus, Envoy, CoreDNS, OpenTracing, Fluentd, gRPC, containerd, rkt, CNI, Jaeger, Notary, TUF, Vitess, NATS, Linkerd, Helm, Rook, Harbor & etcd – along with other cloud native technologies will gather in Barcelona from May 20-23 – sharing insights around and encouraging participation in this fast-growing ecosystem.

Visit us at Booth S55!

LINBIT is announcing some exciting news at Kubecon: NVMe-oF with LINSTOR! LINSTOR can now be used as a standalone product, independent from DRBD. NVMe-oF supports Infiniband with RDMA and allows ultrafast performance, easily handling workloads for Big Data Analytics or Artificial Intelligence. Come say hello at Kubecon! Visit us at Booth S55. 

We are looking forward to you!

May 2019 Newsletter

resync-chart

How LINBIT Improved the DRBD Sync-Rate Logic

About nine years ago, LINBIT introduced the dynamic sync-rate controller in DRBD 8.3.9. The goal behind this was to tune the amount of resync traffic so as to not interfere or compete with application IO to the DRBD device. We did this by examining the current state of the resync, application IO, and network, ten times a second and then deciding on how many resync requests to generate.

We knew from experimentation that we could achieve higher resync throughput if we were to poll the situation during a resync more than ten times a second. However, we didn’t want to shorten this interval by default as this would uselessly consume CPU cycles for the DRBD installations on slower hardware. With DRBD 9.0.17, we now have some additional logic. The resync-rate controller will poll the situation both at 100ms AND when all resync requests are complete, even before the 100ms timer expires.

We estimate that these improvements will only be beneficial to storage and networks that are capable of going faster than 800MiB/s. For our benchmarks, I used two identical systems running CentOS 7.6.1810. Both are equipped with 3x Samsung 960 Pro M.2 NVMe drives. The drives are configured up in RAID0 via the Linux software RAID. My initial baseline test found that in this configuration, the disk can achieve a throughput around 3.1GiB/s (4k sequential writes). For the replication network, we have dual port 40Gb/s Mellanox ConnectX-5 devices. They’re configured as a bonded interface using mode 2 (Balance XOR). This crossover network was benchmarked at 37.7Gb/s (4.3GiB/s) using iperf3. These are actually the same systems used in my NVMe-oF vs. iSER test here.

Then, I ran through multiple resyncs as I found out how best to tune the DRBD configuration for this environment. It was no surprise that DRBD 9.0.17 looked to be the clear winner. However, this was fully-tuned. I wanted to find out what would happen if we scaled back the tuning slightly. I then decided to test while also using the default values for sndbuf-size and rcvbuf-size. The results were fairly similar, but surprisingly 9.0.16 did a little better. For my “fully tuned” test, my configuration was the default except for the following tuning:

net {
max-buffers 80k;
sndbuf-size 10M;
rcvbuf-size 10M;
}
disk {
resync-rate 3123M; # bytes/second, default
c-plan-ahead 20; # 1/10 seconds, default
c-delay-target 10; # 1/10 seconds, default
c-fill-target 1M; # bytes, default
c-max-rate 3123M; # bytes/second, default
c-min-rate 250k; # bytes/second, default
}

Three tests were ran. All tests were done with a 500GiB LVM volume using the /dev/md0 device as the physical volume. A blkdiscard was run against the LVM between each test. The first test was with no application IO to the /dev/drbd0 device and without the sndbuf-size and rcvbuf-size. The second was with no application IO, and fully-tuned as per the configuration above. The third test was again fully-tuned with the configuration above. However, immediately after the resource was promoted to primary, I began a loop that used ‘dd’ to write 1M chunks of zeros to the disk sequentially.

From the graph above, we can see that – when idle – DRBD 9.0.17 can resync at the speeds the physical disk will allow, where as 9.0.16 seems to top out around 2000MiB/s. DRBD 8.4.11 can’t seem to push past 1500MiB/s. However, when we introduce IO to the DRBD virtual disk, both DRBD 9 versions scale back the resync speeds to roughly the same speeds. Surprisingly, DRBD 8.4 doesn’t throttle down as much and hovers around 700MiB/s. This is most likely due to an increase in IO lockout granularity between versions 8 and 9. However, faster here is not necessarily desired. It is usually favorable to have the resync speeds throttled down in order to “step aside” and allow application IO to take priority.

Have questions? Submit them below. We’re happy to help!

Devin Vance on Linkedin
Devin Vance
First introduced to Linux back in 1996, and using Linux almost exclusively by 2005, Devin has years of Linux administration and systems engineering under his belt. He has been deploying and improving clusters with LINBIT since 2011. When not at the keyboard, you can usually find Devin wrenching on an American motorcycle or down at one of the local bowling alleys.
NFS Ansible Playbook

Request Your Highly Available NFS Ansible Playbook!

LINBIT wants to make your testing of our software easy! We’ve started creating Ansible playbooks that automate the deployment of our most commonly clustered software stacks.

The HA NFS playbook will automate the deployment of a HA NFS Cluster, using DRBD9 for replicated storage and Pacemaker as the cluster resource manager; quicker than you can brew a pot of coffee.

Email us for your free trial!

Be sure to write “Ansible” in the subject line.

Build an HA NFS Cluster using Ansible with packages from LINBIT.

System requirements:

  • An account at https://my.linbit.com(contact [email protected]).
  • Deployment environment must have Ansible2.7.0+andpython-netaddr.
  • All target systems must have passwordless SSH access.
  • All hostnames used in inventory file are resolvable (better to use IP addresses).
  • Target systems are CentOS/RHEL 7.

More information is available: LINBIT Ansible NFS Cluster on GitHub.

Linbeat - band, linux

A new star is born: LIN:BEAT – The Band

LINBIT is opening up a new division to specifically address our community’s desire to turn the music up to 11! As you all know, LINBIT is famous for its DRBD, Software-Defined Storage (SDS) and Disaster Recovery (DR) solutions. In a paradigm-shifting turn of events by management, LINBIT has decided to expand into the music industry. Since there is so much business potential in playing live concerts, LINBIT has transformed five of their employees into LIN:BEAT – The Band. These concerts are of course made highly-available by utilizing LINBIT’s own DRBD software.

Linbeat - band, linux

The Band will be touring all Cloud and Linux events around the globe in 2019. Band members use self-written code to produce their unique sound design, reminiscent of drum and bass and heavily influenced by folk punk. The urge to portray all the advantages of DRBD and LINSTOR is so strong they had to send a message to the world: Their songs tell the world about the ups and downs of administrators who handle big storage clusters. LIN:BEAT offers a variety of styles in their musical oeuvre: While “LINSTOR” is a heavy-driven rock song and “Snapshots” speaks to funk-loving people, even “Disaster Recovery,” a love ballad, made it into their repertoire. Lead singer, Phil Reisner, sings sotto voce about his lost love — a RAID rack called “SDS”. Reisner told the reporters, “Administrators are such underrated people. This is sadly unfortunate. We strive to give all administrators a voice. Even if it’s a musical one!”

Crowds will be jumping up and down in excitement when LIN:BEAT comes to town! Be there and code fair!

An Excerpt of the song “My first love is DRBD” written by Phil Reisner:

My first love is DRBD,

there has never been a fee

instead it serves proudly as open source

your replication has changed the course

It’s the crucial key for Linux’s destiny

cloudfest rust

Meet us at Cloudfest in Rust, Germany!

Come and meet LINBIT at CloudFest 2019 in Europa-Park, Rust, Germany. (23rd – 29rd of March 2019)

cloudfest 2019 booth linbit

CloudFest has made a name over the last few years as one of the the best cloud-focused industry events in which to network and have a good time. This year more than 7,000 people are attending the event. Attendees will hear from leaders in the business and get the latest industry buzz. 

The speaker line-up includes names like Dr. Ye Huang, Head of Solution Architects at Alibaba, Will Pemble, CEO at Goal Boss, Bhavin Turakhia, CEO at Flock, or Brian Behlendorf, Inventor of the Apache Web server. 

VISIT us at Booth H24!

LINBIT is announcing some exciting news at Cloudfest: NVMe-oF with LINSTOR! Meaning LINSTOR can now be used as a standalone product, independent from DRBD. NVMe-oF supports Infiniband with RDMA and allows ultrafast performance, easily handling workloads for Big Data Analytics or Artificial Intelligence. Come say hello at Cloudfest! Visit us at Booth H24. 

We are looking forward to you!

Booth visitors will be rewarded with a surprise that even your family will love! 🙂