Control and Data plane Linstor

The advantage of separate control and data planes

Many storage systems have a monolithic design that combines the control plane and the data plane into a single application and a single protocol, but LINBIT’s more modular solution comes with a number of advantages.

What is a control plane or a data plane?

The most important task that any storage system must perform is providing access to the storage volumes that are used for various workloads, for example, databases, file servers or virtualization environments. This is what we refer to as the data plane – all the components that are necessary to actually get data from the storage to the user and from the user to the storage.

Another task is the management of the configuration of storage volumes, which is what we refer to as the control plane . With the rise of more dynamic systems like containerization, virtualization and cloud environments, and the associated software defined storage systems, where storage volumes are frequently reconfigured, this task is becoming increasingly important.

Data and Control plane Linstor

Why it is important: Availability

If you need to shut down part of your infrastructure, because you are updating hardware, for instance it is important when the most fundamental services remain available. Storage is probably one of those fundamental and important services, since most of the other systems rely on it.

A storage system with a modular design that provides independent control and data planes brings your infrastructure one step closer to high availability.

Independent control and data plane

Many storage systems can only provide access to storage volumes if all of their subsystems are online. The design may even be completely monolithic, so that the management functions and the storage access functions are contained within a single application that uses a single network protocol.

In LINBIT’s DRBD-based storage systems, only the most fundamental control plane functions are tightly coupled with the data plane and the operation of storage volumes. High-level control functions, like managing storage volumes and their configuration, managing cluster nodes, or providing automatic selection of cluster nodes for the creation of storage volumes, are provided by the LINSTOR storage management software. These two components, DRBD and LINSTOR, are fundamentally independent of each other.

DRBD storage volumes, even those that are managed by LINSTOR, are kept accessible even if the LINSTOR software is unavailable. This means that the LINSTOR software can be shut down, restarted or upgraded while users retain their access to existing storage volumes. While it is less useful, the same is even true the other way around: a LINSTOR controller that does not rely on storage provided by DRBD and will continue to service storage management requests even if the storage system itself is unavailable. The changed configuration will simply be applied whenever the actual storage system is online again.

Robert Altnoeder on Linkedin
Robert Altnoeder
Robert joined the LINBIT development team in 2013. He had worked with
DRBD at a startup company in the SaaS field before joining LINBIT. His
current primary field of work is the architecture and implementation of
LINSTOR, the cluster management component of LINBIT's SDS software.
art-bridge-linstor-proxmox-plugin

How to setup LINSTOR on Proxmox VE

In this technical blog post, we show you how to integrate DRBD volumes in Proxmox VE via a storage plugin developed by LINBIT. The advantages of using DRBD include a configurable number of data replicas (e.g., 3 copies in a 5 node cluster), access to the data on every node and therefore very fast VM live-migrations (usually takes only a few seconds, depending on memory pressure). Download Linstor Proxmox Plugin

Setup

The rest of this post assumes that you have already set up Proxmox VE (the LINBIT example uses 4 nodes), and have created a PVE cluster consisting of all nodes. While this post is not meant to  replace the DRBD User’s Guide, we try to show a complete setup.

The setup consists of two important components:

  1. LINSTOR manages DRBD resource allocation
  2. linstor-proxmox plugin that implements the Proxmox VE storage plugin API and executes LINSTOR commands.

In order for the plugin to work, you must first create a LINSTOR cluster.

LINSTOR Cluster

We have assumed here that you have already set up the LINBIT Proxmox repository as described in the User’s guide. If you have not completed this set up, execute the following commands on all cluster nodes. First, we need the low-level infrastructure (i.e., the DRBD9 kernel module and drbd-utils):

apt install pve-headers
apt install drbd-dkms drbd-utils
rmmod drbd; modprobe drbd
grep -q drbd /etc/modules || echo "drbd" >> /etc/module

The next step is to install LINSTOR:

apt install linstor-controller linstor-satellite linstor-client
systemctl start linstor-satellite
systemctl enable linstor-satellite

Now, decide which of your hosts should be the current controller node and enable the linstor-controller service on that particular node only:

systemctl start linstor-controller

Volume creation

Obviously, DRBD needs storage to create volumes. In this post we assume a setup where all nodes contain an LVM-thinpool called drbdpool. In our sample setup, we created it on the pve volume group, but in your setup, you might have a different storage topology. On the node that runs the controller service, execute the following commands to add your nodes:

linstor node create alpha 10.0.0.1 --node-type Combined
linstor node create bravo 10.0.0.2 --node-type Combined
linstor node create charlie 10.0.0.3 --node-type Combined
linstor node create delta 10.0.0.4 --node-type Combined

“Combined” means that this node is allowed to execute a LINSTOR controller and/or a satellite, but a node does not have to execute both. So it is safe to specify “Combined”; it does not influence the performance or the number of services started.

The next step is to configure a storage pool definition. As described in the User’s guide, most LINSTOR objects consist of a “definition” and then concrete instances of such a definition:

linstor storage-pool-definition create drbdpool

By now it is time to mention that the LINSTOR client provides handy shortcuts for its sub-commands. The previous command could have been written as linstor spd c drbdpool. The next step is to register every node’s storage pool:

for n in alpha bravo charlie delta; do \
linstor storage-pool create $n drbdpool lvmthin pve/drbdpool; \
done

DRBD resource creation

After that we are ready to create our first real DRBD resource:

linstor resource-definition create first
linstor volume-definition create first 10M --storage-pool drbdpool
linstor resource create alpha first
linstor resource create bravo first

Now, check with drbdadm status that  “alpha” and “bravo” contain a replicated DRBD resource called “first”. After that this dummy resource can be deleted on all nodes by deleting its resource definition:

linstor resource-definition delete -q first

 

LINSTOR Proxmox VE Plugin Setup

As DRBD and LINSTOR are already set up, the only things missing is installing the plugin itself and its configuration.

apt install linstor-proxmox

The plugin is configured via the file /etc/pve/storage.cfg:

drbd: drbdstorage
content images, rootdir
redundancy 2 controller 10.0.0.1

It is not necessary to copy that file to the other nodes, as /etc/pve is already a replicated file system. After the configuration is done, you should restart the following service:

systemctl restart pvedaemon

After this setup is done, you are able to create virtual machines backed by DRBD from the GUI. To do so, select “drbdstorage” as storage in the “Hard Disk” section of the VM. LINSTOR selects the nodes that have the most free storage to create the replicated backing devices.

Distribution

The interested reader can check which ones were selected via LINSTOR resource list. While interesting, it is important to know that the storage can be accessed by all nodes in the cluster via a DRBD feature called “diskless clients”. So let’s assume “alpha” and “bravo” had the most free space and were selected, and the VM was created on node “bravo”. Via the low level tool drbdadm status we now see that the resource is created on two nodes (i.e., “alpha” and “bravo”) and the DRBD resource is in “Primary” role on “bravo”.

Now we want to migrate the VM from “bravo” to node “charlie”. This is again done via a few clicks in the GUI, but the interesting steps happen behind the scene: The storage plugin realizes that it has access to the data on “alpha” and “bravo” (our two replicas) but also needs access on “charlie” to execute the VM. The plugin therefore creates a diskless assignment on “charlie”. When you execute drbdadm status on “charlie”, you see that now three nodes are involved in the overall picture:

• Alpha with storage in Secondary role
• Bravo with storage in Secondary role
• Charlie as a diskless client in Primary role

Diskless clients are created (and deleted) on demand without further user interaction, besides moving around VMs in the GUI. This means that if you now move the VM back to “bravo”, the diskless assignment on “charlie” gets deleted as it is no longer needed.

If you would have moved the VM from “charlie” to “delta”, the diskless assignment for “charlie” would have been deleted, and a new one for “delta” would have been created.

For you it is probably even more interesting that all of this including VM migration happens within seconds without moving the actual replicated storage contents.

Next Steps

So far, we created a replicated and highly-available setup for our VMs, but the LINSTOR controller and especially its database are not highly-available. In a future blog post, we will describe how to make the controller itself highly-available by only using software already included in Proxmox VE (i.e., without introducing complex technologies like Pacemaker). This will be achieved with a dedicated controller VM that will be provided by LINBIT as an appliance.

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.
settings-content-linux-control-linstor

The Technology Inside LINSTOR (Part II)

In our first look into LINSTOR you learned a lot about the single communication protocol, transaction-safety and modularity features. In the next chapter you can dive deeper into the construction.

Fault Tolerance

Keeping the software responsive is one of the more difficult problems that we have to deal with in LINSTOR’s design and implementation. The Controller/Satellite split is one fundamental part of LINSTOR’s design toward fault tolerance, but there are many other design and implementation details that improve the software’s robustness, and many of them are virtually invisible to the user.

On the Controller side, communication and persistence are the two main areas that can lead to the software becoming unresponsive. The following problems could lead to an unusable network communication service on the Controller side:

  • Stopping or reconfiguring a network interface
  • Address conflicts
  • In-use TCP/IP ports

All network I/O in LINSTOR is non-blocking, so that unresponsive network peers do not lead to a lockup of LINSTOR’s network communication service. While the network communication service has been designed to recover from many kinds of problems, it additionally allows the use of multiple independent network connectors, so that the system remains accessible even in the case where a network connector requires reconfiguration to recover. The network connectors can also stop and start independently, allowing reinitialization of failed connectors.

The Controller can obviously not continue normal operation while the database service is inoperative, which could of course happen if an external database is used, for example, due to a downtime of the database server or due to a network problem. Once the database service becomes available again, the Controller will recover automatically, without requiring any operator intervention.

Satellites in LINSTOR

The Satellite side of LINSTOR does not run a database, and a single unresponsive Satellite is less critical for the system as a whole than an unresponsive Controller. Nonetheless, if a Satellite satellite-linstorencounters a failure during the configuration of one storage resource, that should still not temporarily prevent it from being able to service requests for the configuration of other resources.

The biggest challenge regarding fault tolerance on the Satellite side is the fact that the Satellite interacts with lots of external programs and processes that are neither part of LINSTOR nor under the direct control of the Satellite process. These external components include system utilities required for the configuration of backend storage, such as LVM or ZFS commands, processes observing events generated by the DRBD kernel module whenever the state of a resource changes, block device files that appear or disappear when storage devices are reconfigured, and similar kinds of objects.

To achieve fault tolerance on the Satellite side, the software has been designed to deal with many possible kinds of malfunctions of the external environment that LINSTOR interacts with. This includes the time-boxing and the enforcement of size limits on the amount of data that is read back when executing external processes, as well as recovery procedures that attempt to abort external processes that have become unresponsive. There is even a fallback that reports a malfunctioning operating system kernel if the operating system is unable to end an unresponsive process. The LINSTOR code also contains a mechanism that can run critical operations, such as the attempt to open a device file ( which may block forever due to faulty operating system drivers) asynchronously, so that even if the operation blocks, LINSTOR would normally at least be able to detect and report the problem.

Usability

With feature richness, customizability and flexibility, also comes complexity. The only thing that can be done to make the system as easy to understand and use as possible is to attempt to make the system intuitive, self-explaining and unambiguous.

Clarity in the naming scheme of objects turned out to be an important factor for a user’s ability to use the software intuitively. In our previous product, drbdmanage, users would typically look for commands to either create a “resource” or a “volume.” However, the corresponding commands, “new-resource” and “new-volume”, only define a resource and its volumes, but do not actually create storage resources on any of the cluster nodes. Another command, “assign”, was required to assign the resource to cluster nodes, thereby creating the actual storage resource, and users sometimes had a hard time finding this command.

For this reason, the naming of objects was changed in LINSTOR. A user looking for a command to create a resource will find the command that actually creates a storage resource, and one of the required parameters for this command is the so-called resource definition. It is quite obvious that the next step would be to look for a command that creates a resource definition. This kind of naming convention is supposed to make it easier for users to figure out how to intuitively use the application.

LINSTOR is also explicit with replies to user commands, as well as with return codes for API calls. The software typically replies with a message that describes whether or not the command was successful, what the software did, and to which objects the message refers. Error messages that include a description of the problem cause or hints for possible correction measures also follow a uniform structure.

Similar ideas also applies to return codes, which include not only the error code (e.g., Object exists), but also information on what objects the error refers to (e.g., the type of object and the identifier specified by the user).

Reporting System

To make diagnosing errors easier, LINSTOR also generates a unique identifier for every error that is logged. The traditional logging and error reporting on Unix/Linux systems basically consists of single text lines logged to one large logfile, sometimes even a single logfile for many different applications. An application could log multiple lines for each error, but support for logging multiple lines atomically (instead of interleaved with log lines for other errors, possibly from other applications) is virtually nonexistent.

For this reason, LINSTOR logs a single-line short description of the error, including the error identifier, to the system log, but also logs the details of the error to a report file that can be found using the error identifier. The detailed log report also contains information such as the component where the error occured, the exact version of the software that was used, debug information, nested errors, and many other details that may help with problem mitigation.

Implementation Quality

While the various design characteristics are important factors for creating a powerful and robust software system, even the best design cannot produce a reliable application if it is not implemented with high quality.

The first step, even before we wrote the code, was to choose a programming language that would be suitable for the task. While our previous product, drbdmanage, and the current LINSTOR client are implemented in Python, the LINSTOR server-side components (the Controller and Satellite) are implemented in Java. A server application that manages highly available storage systems should obviously be designed and implemented much more carefully than the typical single-user desktop application. Java is a very strict programming language that provides strong static typing, checked exceptions and allows only few implicit type conversions – which are all features that also enable IDEs to perform static checking of the code while it is being written.

Obviously, while it can make writing high quality code easier, the choice of programming language alone does not automatically lead to better code. To keep LINSTOR’s code clean, readable, self-explaining and maintainable, we apply many of the best practices that have proven successful in the creation of mission-critical software systems. This includes more important things like choosing descriptive variable names or maintaining a clear and logical control flow, but even extends to less technical details like consistent formatting of the source code. The coding standard that we apply to produce high-quality code is based on standards from the aviation industry and is among the strictest coding standards that exist today.

checking-checklist-linstorEasy Validity Checks

There is also a strong focus on correctness and strict checking in the way LINSTOR is implemented. As an example, the name of objects like nodes, resources or storage pools is not simply a String, but an object that can only be constructed with a name that is valid for that kind of object. It is impossible to create a resource name object that contains invalid characters, or to accidentally use a resource name object as the identifier for the creation of a storage pool. As a result, developers cannot forget to perform a validity check on a node name or on a volume number, and they also cannot apply the wrong check by accident.

All those considerations, design characteristics and implementation methods are important factors that helped us create a dependable and user friendly software that we hope will prove useful and valuable to its users like you.

 

If you have any questions or suggestions concerning LINSTOR, please leave a comment or write email to [email protected] .

 

Robert Altnoeder on Linkedin
Robert Altnoeder
Robert joined the LINBIT development team in 2013. He had worked with
DRBD at a startup company in the SaaS field before joining LINBIT. His
current primary field of work is the architecture and implementation of
LINSTOR, the cluster management component of LINBIT's SDS software.
control room linstor management

The technology inside LINSTOR (Part I)

Spotlight on LINSTOR’s design and technology: What we do and how we do it to create a powerful, flexible and robust storage cluster management software

LINSTOR is an application that is typically integrated with highly automated systems, such as software defined storage systems or virtualization environments. Users often interact with the management interface of some other application that uses LINSTOR to manage the storage required for that application’s use case, which also means that the users may not have direct access to the storage systems or to the LINSTOR user interface.

A single storage cluster can be the backend of multiple independent application systems, so the biggest challenge for a software like LINSTOR is to remain responsive even if some actions or components of the cluster fail. At the same time, the software should be flexible enough to cover all use cases, to enable future extension or modification, and despite all the complexity that is the result of these requirements, it should at the same time be easy to understand and easy to use for the administrators who are tasked with installing and maintaining the storage system.

It is quite clear to anyone who has worked on a bigger software project as a developer that many of those requirements work against each other. Customizability, flexibility, an abundance of features cause complexity, but complexity is the natural enemy of usability, reliability and maintainability. When we started the development of LINSTOR, our challenge was to design and implement the software so that it would achieve our goals with regards to feature richness and flexibility while at the same time remaining reliable and easy to use.

Modularity

One of the most important aspects of LINSTOR’s design is its modularity. We divided the system into two components, the Controller and the Satellite, so that the Controller component could remain as independent as possible from the Satellite component – and vice versa.

Even inside those two components, many parts of the software are exchangeable – the communication layer, the serialization protocol, the database layer, all of its API calls, even all of the debug commands that we use for internal development, as well as many other implementation details are exchangeable parts of the software. This provides not only a maximum of flexibility for future extensions, it also acts as a sort of safety net. For example, if support for the database or the serialization protocol that we use currently were dropped by their maintainers, we could simply exchange those parts without having to modify every single source code file of the project, because implementation details are hidden behind generic interfaces that connect various parts of our software.

Another positive side effect is that many of those components, being modular, are naturally able to run multiple differently configured instances. For example, it is possible to configure multiple network connectors in LINSTOR, each bound to different network interfaces or ports.

Linstor Linbit Opennebula openstack

A single communication protocol

As a cluster software, LINSTOR must of course have some mechanism to communicate with all of the nodes that are part of the cluster. Integration with other applications also requires some means of communication between those applications and the LINSTOR processes, and the same applies to any kind of user interface for LINSTOR.

There are lots of different technologies available, but many of them are only suitable for certain kinds of communication. Some clusters use distributed key/value stores like etcd for managing their configuration, but use D-Bus for command line utilities and a REST interface for connecting other applications.

Instead of using many different technologies, LINSTOR uses a single versatile network protocol for communication with all peers. The protocol used for communication between the Controller and the Satellites is the same as the one used for communication between the Controller and the command line interface or any other application. Since this protocol is implemented on top of standard TCP/IP connections, it also made all aspects of LINSTOR’s communication network transparent. An optional SSL layer can provide secure encrypted communication. Using a single mechanism for communication also means less complexity, as the same code can be used for implementing different communication channels.

Transaction-safety

Even though LINSTOR keeps its configuration objects in memory, there is an obvious need for some kind of persistence. Ideally, what is kept in memory should match what is persisted, which means that any change should be a transaction, both in memory and on persistent storage.

Most Unix/Linux applications have traditionally favored line-based text files for the configuration of the software and for persisting its state, whereas LINSTOR keeps its configuration in a database. Apart from the fact that a fully ACID-compliant database is an ideal foundation for a building a transaction-safe application, using a database also has other advantages. For example, if an upgrade of the software requires changes to the persistent data structures, the upgrade of the data can be performed as a single transaction, so that the result is either the old version or the new version of the data, but not some broken state in between. Database constraints also provide an additional safeguard that helps ensuring the consistency of the data. Assuming that there were a bug in our software, so that it would fail to detect duplicate volume numbers being assigned to storage volumes, the database would abort the transaction for creating the volume due to constraint violations, thereby preventing inconsistencies in the corresponding data structures.

To avoid requiring users to set up and maintain a database server, LINSTOR uses its own integrated database by default – it is simply started as an integral part of the Controller component. Optionally, the Controller can also access a centralized database by means of a JDBC driver.

Read more in the second blog post! 

Find LINSTOR in our Github repository

Robert Altnoeder on Linkedin
Robert Altnoeder
Robert joined the LINBIT development team in 2013. He had worked with
DRBD at a startup company in the SaaS field before joining LINBIT. His
current primary field of work is the architecture and implementation of
LINSTOR, the cluster management component of LINBIT's SDS software.
Block storage drbd compression deduplication

Linux Data Deduplication and Compression: One more reason to use block level data replication.

Having recently returned from my 6th Red Hat Summit (RHS), I’m writing this blog to answer a common question: “why replicate at the block level?” Using block-level replication, we can easily add high availability or disaster recovery features to any application that doesn’t natively support them.

The most frequently asked question we heard at RHS was, “how do you compare to [insert application replication OR filesystem here?]”. In most cases, the answer was, “LINBIT’s replication software, DRBD, replicates data at the block level.” It would be an extreme task to run performance comparisons vs all of the other replication technologies on the market, so generally we provide background information, including:

 

  • DRBD can usually replicate with 1-3 percent overhead to the cluster’s backing disks, as measured by FIO
  • In dual-primary mode, overhead increases to 15-20 percent
  • DRBD is compatible with any application or Linux filesystem, and is effective at replicating multiple applications simultaneously.
  • DRBD has a read-balancing feature. If you are running a read intensive application, DRBD will pass through reads to secondary nodes once the primary is running at maximum capacity, enabling you to leverage all of your replicated systems. One test showed 1.7x the read performance compared to the advertised speed of the drive.

Deduplication and Compression

Generally, it comes down to efficiency. EMC, NetApp, and the other big storage players use block level replication in their appliances because this way the replication doesn’t need to go “all the way up the stack.” It enables flexibility, stability, and performance. And now, Red Hat has given us one more reason to replicate at the block level: Deduplication and Compression.

In the most recent Red Hat Enterprise Linux 7.5 release, Red Hat announced integration of Red Hat VDO, or Virtual Data Optimizer. VDO is used for deduplication and compression of Linux environments. Though it can be paired with other replication technologies, it can only be fully leveraged when the replication sits underneath the VDO device. Why? You want to deduplicate and compress your data before replicating it for efficiency gains.

Effective transfer times

According to Louis Imershein, Red Hat’s Principal Product Manager for data reduction technologies, “Solutions like LINBIT’s DRBD are able to capture data below the VDO layer.  This means that datasets that benefit from deduplication and compression get replicated in their dehydrated form. With less data to move, Red Hat Enterprise Linux customers with LINBIT DRBD can benefit from faster effective transfer times and reduced bandwidth requirements.”

So, as you’re thinking about underlying storage for your applications, ensure you are using a solution which allows you to maximize the benefit of the existing Linux utilities built in, and around, your Operating System. Thanks to Red Hat, block level replication is now more important than ever.

 

Greg Eckert on Linkedin
Greg Eckert
In his role as the Director of Business Development for LINBIT America and Australia, Greg is responsible for building international relations, both in terms of technology and business collaboration. Since 2013, Greg has connected potential technology partners, collaborated with businesses in new territories, and explored opportunities for new joint ventures.

Cluster-wide management of replicated storage with LINSTOR

The new generation of LINBIT’s storage management system focuses on ease-of-use, resilience and scalability.

Todays IT installations often consist of many individual servers, each running some part of the software infrastructure that together form the kind of service that the installation is supposed to provide. Software processes rely on data, and high availability or disaster recovery solutions, which have typically included replication of the data to one or multiple other physically independent systems.

LINSTOR is the new generation of the software component that implements the automatic management of replicated storage resources in the LINBIT SDS system. Besides adding new features that users have been previously requested, such as the ability to make use of multiple-tier storage, LINBIT has also improved the existing features.

Linstor Linbit Opennebula openstack

Linstor features

Ease-of-use

Our experience has shown that administrators of complex IT environments typically struggle with two things: figuring out how to make the system do what they want, and determining the cause of the problem when a system fails to do what the administrators expect. Creating a new product has given us the opportunity to consider these issues during the design phase and to focus on making the new software easier to use and troubleshoot. Two examples of related enhancements are the more consistent and logical naming of LINSTOR objects and commands, and the greatly enhanced logging and problem reporting.

Resilience

Another area of improvement that we focused on is the resilience of the system as a whole, which depends not only on the LINSTOR software, but also on the entire external environment. For this reason, we designed LINSTOR to manage unexpected changes and to recover from many different types of failures of external components.

Scalability

LINSTOR greatly increases scalability by its ability to perform changes on multiple resources and on multiple nodes concurrently, while still remaining responsive to new requests.

Multi tier storage

Many users have requested the support of multiple-tier storage, and we are pleased to announce that, by adding the concept of storage pools, it has been implemented in LINSTOR. We made this a flexible feature, so that multiple storage pools can be configured, even using different storage backend drivers per storage pool and/or per node if necessary.

The new software is also capable of dealing with multiple network interface cards, each of which can be used as the replication link for DRBD resources or as the communication link for LINSTOR. This feature enables splitting the control network (providing LINSTOR communication) from the data network (providing DRBD replication link communication). IPv6 is  supported for both, LINSTOR communication and DRBD replication links.

RoadmapProduction release roadmap

The roadmap for the production release includes support for:

  • taking snapshots of replicated resources
  • thinly provisioned LVM storage
  • ZFS storage
  • encrypted and authenticated network communication within LINSTOR
  • taking advantage of LINSTOR’s multi-user-capability
Robert Altnoeder on Linkedin
Robert Altnoeder
Robert joined the LINBIT development team in 2013. He had worked with
DRBD at a startup company in the SaaS field before joining LINBIT. His
current primary field of work is the architecture and implementation of
LINSTOR, the cluster management component of LINBIT's SDS software.
Split brain

Split Brain? Never Again! A New Solution for an Old Problem: DRBD Quorum

While attending OpenStack Summit in Atlanta, I sat in a talk about the difficulties of implementing High Availability (HA) clusters. At one point, the speaker presented a picture of a split-brain, discussed the challenges in resolving them, and implementing STONITH in certain environments. As many of you know, “split-brain” is a condition that can happen when each node in a cluster thinks that it is the only active node. The system as a whole loses grip on its “state”; nodes can go rogue, and data sets can diverge without making it clear which one is primary. Data loss or data corruption can result, but there are ways to make sure this doesn’t happen, so I was interested in probing further.

Fencing is not always the solution

Split brain

The Split brain problem can be solved by DRBD Quorum.

To make it more interesting, it turned out that the speaker’s company uses DRBD and Pacemaker for HA, a setup that is very familiar to us. After the talk, I approached the speaker and recommended that they consider “fencing” as a way to avoid split-brain. Fencing regulates access to a shared resource and can be a good safeguard. As the resource needs separate communication path best practices suggest not using the same one that it is trying to regulate, so it needs a separate communication path. Unfortunately, in his environment, redundant networking was not possible. We needed another method.

Split brain is solved via DRBD Quorum

After talking to the speaker, it was clear to me that a new option for avoiding split brain or diverging data sets was needed since existing solutions may not always be feasible in certain infrastructures. This got me thinking about the various options for avoiding split-brain and how fencing could be implemented by using the built-in communication found in DRBD 9. It turns out that the capability of mirroring more than two nodes, found in DRBD 9 is a viable solution.

That idea sparked the work on the newest feature in DRBD: Quorum.

Shortly thereafter, the LINBIT team developed and integrated a working solution into DRBD. The code was pushed to the LINBIT repository and ready for testing.

Interest was almost immediate!

Later on, I happened to meet a few folks from IBM UK. They were working on IBM MQ Advanced Software, the well-known messaging middleware software that helps integrate applications and data across multiple platforms. They intended to use DRBD for their replication needs and quickly became interested in the idea of using a Quorum mechanism to mitigate split-brain situations.

DRBD Quorum takes new perspective

IBM LogoThe DRBD Quorum feature takes a new approach to avoiding data divergence.  A cluster partition may only modify the replicated data set if the number of nodes that can communicate is greater than half of the overall number of nodes within the defined cluster. By only allowing writes on a node that has access to over half the nodes in a given partition, we avoid creating a diverging data set.

The initial implementation of this feature would cause any node that lost Quorum (and was running the application/data set) to be rebooted.  Removing access to the data set is required to ensure the node stops modifying data. After extensive testing, the IBM team suggested a new idea that instead of rebooting the node, terminate the application. This action would then trigger the already available recovery process, forcing services to migrate to a node with Quorum!

Attractive alternative to fencing

As usual, the devil is in the details. Getting the implementation right with the appropriate resync decisions was not as straightforward as one might think. In addition to our own internal testing, many IBM engineers also tested it as well. We are happy to report that current implementation does exactly what was expected!

Bottom line:

If you need to mirror your data set three times, the new DRBD Quorum feature is an attractive alternative to hardware fencing.

In case you want to learn more about the Quorum implementation in DRBD
please see the DRBD9 user’s guide:
https://docs.linbit.com/docs/users-guide-9.0/#s-feature-quorum
https://docs.linbit.com/docs/users-guide-9.0/#s-configuring-quorum

Image  (Lloyd Fugde – stock.adobe.com)

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

 

 

LINBIT’s DRBD ships with integration to VCS

The LINBIT DRBD software has been updated with an integration for Veritas Infoscale Availability (VIA). VIA, formerly known as Veritas Cluster Server (VCS), is a proprietary cluster manager for building highly available clusters on Linux. Examples of application cluster capabilities are Network File Sharing databases or e-commerce websites. VCS solves the same problem as the Pacemaker Open Source projects.  

Yet, in contrast to Pacemaker, VCS has a long history on the Unix Platform. VCS came to Linux as Linux began to surpass legacy Unix platforms. In addition to its longevity, VCS has a strong and clean user experience. For example, VCS is ahead of the Pacemaker software when it comes to clarity of log files. Notably, the Veritas Cluster Server has slightly fewer features than Pacemaker. (With great power comes complexity!)

Gear-drbd-integration-VCS

The gear runs even smoother. DRBD has an integration for VCS.

VCS integration for DRBD

Since January 2018, DRBD has been shipping with an integration to VCS. Users are now able to use VCS instead of Pacemaker and even control DRBD via VCS. It consists of two agents: DRBDConfigure and DRBDPrimary that enable drbd-8.4 and drbd-9.0 for VCS.

Full documentation can be found here on our website:

https://docs.linbit.com/docs/users-guide-9.0/#s-feature-VCS

and

https://github.com/LINBIT/drbd-utils/tree/master/scripts/VCS

Besides VCS Linbit DRBD supports variety of Linux software so you can keep your system up and running.

Besides VCS Linbit DRBD supports variety of Linux software so you can keep your system up and running.

Pacemaker 1.0.11 and up
Heartbeat 3.0.5 and up
Corosync 2.x and up

 

Reach out to [email protected] for more information.

We are driven by the passion of keeping the digital world running. That’s why hundreds of customers trust in our expertise, services and products. Our OpenSource product DRBD has been installed several million times. Linbit established DRBD® as the industry standard for High-Availability (HA) and data redundancy for mission critical systems. DRBD enables disaster recovery and HA for any application on Linux, including iSCSI, NFS, MySQL, Postgres, Oracle, Virtualization and more.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

 

Why Does Higher Education Require Always-On Capabilities?

People understand the importance of hospital systems needing to be Highly Available. This is easy to explain since people’s LIVES depend on medical equipment and information being accessible at all times. Likewise, people understand the importance of banks needing High Availability (HA) — they expect access to their MONEY on-demand and want it protected. You don’t have to be a techie to quickly understand why hospitals and banks need to be constantly available. However, the need for HA at educational institutions is a bit more difficult to initially identify, because they are not often thought of as places where ‘mission-critical’ systems are a real requirement. I believe the story is told less, as it has an underwhelming shock factor– people’s lives are not at stake, nor is their money hanging in the balance. At LINBIT, we have many educational customers, including prestigious universities, and we wanted to get their perspective on why HA and why LINBIT. Read more

Dreaded Day of Downtime

Some say that no one dreads a day of downtime like a storage admin.

I disagree. Sure, the storage admins might be responsible for recovering a whole organization if an outage occurs; and sure, they might be the ones who lose their jobs from an unexpected debacle, but I would speculate that others have more to lose.

First, the company’s reputation takes a big, possibly irreparable hit with both clients and  employees. Damage control usually lasts far longer than the original outage.  Take the United Airlines case from earlier in 2017 when a computer malfunction led to the grounding of all domestic flights. Airports across the country were forced to tweet out messages about the technical issues after receiving an overwhelming number of complaints. Outages such as this one can take months or years to repair the trust with your customers. Depending upon the criticality of the services, a company could go bankrupt. Despite all this, even the company isn’t the biggest loser; it is the end-user: and that is what the rest of this post will focus on.

Let’s say you’re a senior in college. It’s spring term, and graduation is just one week away.  Your school has an online system to submit assignments which are due at midnight, the day before finals week. Like most students at the school, you log into the online assignment submission module, just like you have always done.  Except this time, you get a spinning wheel. Nothing will load. It must be your internet connection. You call a friend to have them submit your papers, but she can’t login either. The culprit: the system is down.

Now, it’s 10:00 PM and you need to submit your math assignment before midnight. At 11:00 PM you start to panic. You can’t log-in and neither can your classmates.  Everyone is scrambling. You send a hastily written email to your professor explaining the issue. She is unforgiving because you shouldn’t have procrastinated in the first place. At 1:00 AM, you refresh the system and everything is working (slowly), but the deadlines have passed. The system won’t let you submit anything. Your heart sinks as you realize that without that project, you will fail your math class and not be able to graduate.

This system outage caused heartache, stress and uncertainty for the students and teachers along with a whole lot of pain for the administrators.  The kicker is that the downtime happened when traffic was anticipated to be the highest! Of course, the servers are going to be overloaded during the last week of Spring term. Yet, notoriously, the University will send an email stating that it experienced higher than expected loads; and that ultimately, they weren’t prepared for it.

During this time, traffic was 15 times its normal usage, and the Hypervisor hosting the NFS server and the file sharing system was flooded with requests.  It blew a fan and eventually overheated. Sure, the data was still safe inside the SAN on the backend.  However, none of that mattered when the students couldn’t access the data until the admin rebuilt the Hypervisor. By the time the server was back up and running, the damage was done.

High Availability isn’t a simple concept but it is critical for your organization, your credibility, and even more importantly, for your end-users or customers. In today’s world, the bar for “uptime” is monstrously high therefore downtime is simply unacceptable.

If you’re a student, an admin or a simple system user- I have a question for you (and don’t just think about yourself, think about your boss, colleagues, and clients):

What would your day look like if your services went unresponsive right… NOW?!
Learn more about the costs and drivers of data loss, and how to avoid it, by reading the paper from OrionX Research.

 

Greg Eckert on Linkedin
Greg Eckert
In his role as the Director of Business Development for LINBIT America and Australia, Greg is responsible for building international relations, both in terms of technology and business collaboration. Since 2013, Greg has connected potential technology partners, collaborated with businesses in new territories, and explored opportunities for new joint ventures.