ladybug coccinelle project

How to make DRBD compatible to the Linux kernel

The Linux kernel’s interface towards device drivers is not set in stone. It evolves with the evolvement of Linux. Sometimes driven by hardware improvements, sometimes driven by general evolution of the code base. In tree drivers this is not a big issue since they get changed as the interfaces are modified, therefore are called “tree wide changes”.

DRBD development happens out-of-tree before code gets sent upstream. We at LINBIT need to track these tree wide changes and follow them in the DRBD device driver. But, not all of our users are using the same kernel.

Some are strictly sticking to the kernel that was shipped with the distribution, running a kernel that is years behind Linus’ version of Linux.

That creates a problem. DRBD should be compatible with both: Years old, carefully maintained “Vendor kernels”, and the latest and greatest Linus kernel.

In order to do that we have a kernel compatibility layer in DRBD. It contains two main parts:

  1. Detecting the capabilities of the kernel we want to build against. The background is that a “Vendor kernel” is not just a random old Linus kernel. It starts as some release of the vanilla kernel, and then the vendors cherry-pick upstream changes and fixes they deem to be relevant to their vendor kernel.
  2. The compatibility layer itself. Up to DRBD-9.0.19 this was a huge file containing many #IFDEFs. It became a maintenance nightmare. It was hard to extend, hard to understand and debug, and hard to remove old compat. Everything was ugly.

Coccinelle is French for Ladybug

The Coccinelle project from INRIA contains a tool to apply semantic patches to the source code, or expand the semantic patches to conventional patches. A few of us DRBD developers practically fell into love with that tool. It allows us to express how some code that is compatible with the upstream kernel needs to be changed in order to be compatible with some older versions of the kernel. 

This allows our DRBD upstream code to be in a form that has clean Linux upstream code, containing no compatibility hacks.

This allows us to automatically transform DRBD to be compatible with random old kernels or vendor kernels. The result, after the transformation, is clean C code without confusing macros and #IFs. It is wonderful.

The new kernel compatibility mechanism:

  1. Detect kernel capabilities (as before)
  2. Create a compat patch using spatch (from Coccinelle)
  3. Apply the compat patch and compile DRBD

Where there is light there must be shadow

The spatch tool is not available on all Linux distributions. For a little older kernels we even require a very recent version of spatch, which is even less available. The researchers at INRIA write the tool in a programming language, “OCaml”, which is right for them and the challenge, but not familiar to many in the open source community.

This complex build dependency makes it harder for community members to build drbd-9.0.20 and higher compared to how it was before.

The shortcut through the maze

For a number of well known vendor kernels (RHEL/Centos, Ubuntu-LTS, SLES, Oracle linux, Amazon Linux) we include the complete compat patches in the distribution source tar.gz. Meaning, during the DRBD build process it executes all the COMPAT tests and calculates a hash value out of that.

If it finds a pre-generated compat.patch for that hash value, the build process can continue without a call to spatch! Complex build dependency avoided!

The hard route through the maze

When you are building from a GIT-checkout, or you are building for a kernel for which we did not include the pre-generated compat.patch you need spatch.

If necessary you can run step 2 (using spatch) on a different machine then approach step 1 (testing kernel capabilites) and step 3 (compiling the drbd kernel module).

Use ‘make’ to start the compilation process. If it fails just after this output:

 [...]

 COMPAT  sock_create_kern_has_five_parameters

 COMPAT  sock_ops_returns_addr_len

 COMPAT  submit_bio_has_2_params

 CHK     /home/phil/src/drbd09/drbd/compat.4.15.18.h

 UPD     /home/phil/src/drbd09/drbd/compat.4.15.18.h

 CHK     /home/phil/src/drbd09/drbd/compat.h

 UPD     /home/phil/src/drbd09/drbd/compat.h

 GENPATCHNAMES   4.15.0-48-generic

 SPATCH   27e10079afbff16b2b82fae9f7dbe676

 

Please take note of the hash value after “SPATCH”. That is like a fingerprint containing all the results of the countless “COMPAT” tests that were executed just before.

Then you need to copy the results of the COMPAT tests to machine/VM/container that has the same drbd source directory and a recent spatch.

 

rsync -rv drbd/drbd-kernel-compat/cocci_cache/27e10079afbff16b2b82fae9f7dbe676 \

 [email protected]:src/drbd-9.0.20/drbd/drbd-kernel-compat/cocci_cache/

 

Then you run the spatch part of the build process there:

 

ssh [email protected] "make -C src/drbd-9.0.20/drbd compat"

 

After that you copy the resulting compat.patch back:

 

rsync -rv \ [email protected]:src/drbd-9.0.20/drbd/drbd-kernel-compat/cocci_cache/ \

  drbd/drbd-kernel-compat/cocci_cache/

 

Call ‘make’ to restart the build process. If you did it right, it will find the generated compat.patch and finish the compilation process.

Get a Ladybug

If you’d like to get a spatch that is recent enough for building the DRBD driver, use a docker container we published on dockerhub https://hub.docker.com/r/linbit/coccinelle.

 

docker pull linbit/coccinelle

 

Then put the following shell script under the name ‘spatch’ into your $PATH.

 

#!/bin/bash

docker run -it --rm -v "$PWD:$PWD" -w "$PWD" coccinelle:latest spatch "[email protected]"

DRBD compatible to the Linux kernel

All of this is great for making the code more readable, easier to understand, and more likely to contain less bugs. And, having the DRBD code, without backward compatibility clutter, is an important milestone on the path to getting DRBD-9 into Linus’ vanilla kernel and replacing drbd-8.4 with drbd-9 from there.

 

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.
Performance drbd10 drbd9

Performance Gains with DRBD 10

A key factor in evaluating storage systems is their performance. LINBIT has been working to further improve the performance of DRBD. The recent DRBD 10 alpha release demonstrates significant gains.

The performance gains particularly help with highly concurrent workloads. This is an area that has been steadily rising in importance, and looks set to continue to rise. Improvements in single core speed appear to be stagnating while the availability of ever increasing numbers of cores is growing. Hence software systems need to utilize concurrency effectively to make the most of the computing resources.

We tested DRBD 10 with 4K random writes and various concurrency levels. In this test, the data is being replicated synchronously (“protocol C”) between two nodes. These numbers are for a single volume, not an aggregate over many volumes. I/O was generated by 8 processes. The tests show improvements in raw random write performance of up to 68%.

drbd10 performance gains

These improvements were achieved by using a finer-grained locking scheme. This allows, for instance, one core to be sending a request while a second core is submitting the next request. The result is better utilization of the available cores and overall higher throughput.

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. The software versions used were DRBD 10.0.0a1 and its most recent ancestor from the DRBD 9 branch (8e93a5d93b62). I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8 --ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1 --runtime=60 --size=48G --filename=/dev/drbd500

Ongoing development on DRBD 10

LINBIT is working on a number of exciting major features for DRBD 10.

  • Request forwarding. DRBD will send data to geographically distant sites only once and it will be replicated there.
  • PMEM journaling. DRBD can already access its metadata in a PMEM optimized fashion. That will be extended to using a PMEM device as a write-back cache, resulting in improved performance in latency-sensitive scenarios.
  • Erasure coding. DRBD will be able to erasure code and distribute its data. This provides the same functionality as RAID5/6, but with an arbitrary number of parity nodes. The result is lower disk usage with similar redundancy characteristics.

Stable releases of DRBD 10 are planned for 2020 – until then stay tuned for upcoming updates!

 

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.

LINSTOR LDAP Authentication

New Features of LINSTOR Release – July 2019

The Newest LINSTOR release (July 2019) came with a bunch of new features, and one that is really worth highlighting:

The developers of LINSTOR, the storage management tool for all things Linux, announced that the latest release comes with authentication for LDAP. Software-defined storage consumers were demanding privilege authentication, so we set this as a priority in July.

With support for basic LDAP authentication, you can configure an LDAP server and a search_filter to allow only members of a certain group access to LINSTOR. To accomplish this, here’s a sample configuration entry

 [ldap]

  enabled = true

  uri = "ldaps://ldap.example.com"

  dn = "uid={user},ou=users,o=ha,dc=example"

  search_base = "dc=example"

  search_filter =

"(&(uid={user})(memberof=cn=linstor,ou=services,o=ha,dc=example))"

```

The `{user}` template variable will be replaced with the login user.

Please note that LINSTOR must be configured with HTTPS in order to configure LDAP authentication. 

Now you can securely manage privileges of your storage clusters, so the antics of those pesky interns don’t keep you awake at night.

 

Greg Eckert on Linkedin
Greg Eckert
In his role as the Director of Business Development for LINBIT America and Australia, Greg is responsible for building international relations, both in terms of technology and business collaboration. Since 2013, Greg has connected potential technology partners, collaborated with businesses in new territories, and explored opportunities for new joint ventures.

LINBIT Announces DRBD Support for Amazon Linux

The Digital Transformation

The concept of “Digital Transformation” for executive teams at Fortune-sized companies is no longer a new and flashy phrase. An important part of this Digital Transformation is how companies think about cloud computing. When once, organizations seemed to have only 2 decisions: enter the cloud, or keep everything on premise; now the options are a bit more “cloudy” (pun intended).

In the digital transformation age, Fortune companies are looking at multi-cloud strategies. They understand that siloing data into one cloud provider decreases their flexibility and ability to negotiate discounts while increasing the risks of a provider outage affecting production workloads. When Fortune 1000 companies think about their multi-cloud strategies they basically have 3 options:

  1. Keep some data on-prem and put some in the cloud
  2. Put data in different regions or zones within a single cloud provider
  3. Place data in many separate cloud providers

What’s great about all three is that companies can be dynamic about how they solve business goals, allocate budget, and provision resources. With this multi-cloud shift, some of the traditional technologies used in businesses need to adapt and change.

One of our Fortune 500 clients who develops financial software and sells financial, accounting, and tax preparation software, came to us because they were switching an OS installation from RHEL to Amazon Linux. Clearly, they are deep into their Digital Transformation journey because though this workload was already in the cloud. Changing both the OS and automation toolchain of a cloud deployment this large is no easy feat.

As a small team, we pride ourselves in jumping high at client requests, and therefore within two weeks the work was done. The answer is “Yes. LINBIT now supports DRBD 9.0 on Amazon Linux.” As client demand changes, as workloads migrate to the cloud, and as containers gain traction, we are doing our best to be dynamic by listening to community & client feedback. 

With millions of downloads, we rely on clients and the open-source community users to tell us what they want. If you haven’t been following our progress, this means we are thinking about how to improve performance for Linux High Availability Clusters and Disaster Recovery clusters for traditional workloads on hardware like NVMe and Optane, while also looking into kernel technology’s role in Kubernetes environments in conjunction with public and private cloud environments. What challenges exist here that didn’t before? What do users want? These are the questions that drive our development.

So DRBD users: we’re here. We’re listening, feel free to chime in on the community IRC (#drbd on freenode) & mailing list forums, respond in the comments here, ask questions about our Youtube videos… & lets ensure that open-source continues to drive innovation as the commercial giants are deciding which technologies to choose in their 5 year technology goals.

Greg Eckert on Linkedin
Greg Eckert
In his role as the Director of Business Development for LINBIT America and Australia, Greg is responsible for building international relations, both in terms of technology and business collaboration. Since 2013, Greg has connected potential technology partners, collaborated with businesses in new territories, and explored opportunities for new joint ventures.

Kubernetes Operator: LINSTOR’s Little Helper

Before we describe what our LINSTOR Operator does, it is a good idea to discuss what a Kubernetes Operator actually is. If you are already familiar with Kubernetes Operators, feel free to skip the introduction.

Introduction

CoreOS describes Operators like this:

An Operator is a method of packaging, deploying and managing a Kubernetes application. A Kubernetes application is an application that is both deployed on Kubernetes and managed using the Kubernetes APIs and  kubectl tooling.

That is quite a lot to grasp if you are new to the concept of Operators. What users want to do is their work, without having to worry about setting up infrastructure. Still, there is more to a software lifecycle than just firing up the application in a cluster once. This is where an Operator comes into play. In my opinion, a good analogy is to think of a Kubernetes Operator as an actual human operator. So what would the responsibilities of such a human operator be?

A human operator would be an expert in the business logic of the software that she runs and the dependencies that need to be fulfilled to run the software. Additionally, the operator would be responsible for configuring the software: to scale it, to upgrade it, to make backups, and so on. This is also the responsibility of a Kubernetes Operator implemented in software. It is a software component that is built by experts for a particular containerized software. This Operator is executed by an administrator who isn’t necessarily an expert for that particular software.

A Kubernetes Operator is executed in the Kubernetes cluster itself. In contrast to shell scripts or Ansbile playbooks, which are pretty generic, the Operator framework has one specific purpose, as well as access to Kubernetes cluster information. Additionally, they are managed by standard Kubernetes tools and not some external tools for configuration management. An Operator has a managed lifecycle on its own and is managed by the lifecycle manager.

LINSTOR Operator

CoreOS FAQ contains the following sentence:

Experience has shown that the creation of an Operator typically starts by automating an application’s installation and self-service provisioning capabilities, and then evolves to take on more complex automation.

This pretty much sums up the current state of the linstor-operator. The project is still very young and the current focus was on automating the setup of the LINSTOR cluster. If you are familiar with LINSTOR, you know that there is a central component named the LINSTOR controller, and workers — named LINSTOR satellites — that actually create LVM volumes and configure DRBD to provide data redundancy.

Currently, the Operator can add new nodes/kubelets to the LINSTOR cluster by registering them with their name and network interface configuration by the LINSTOR controller. A second important task is that a LINSTOR satellite can actually provide storage to the cluster. Therefore, the linstor-operator can also register one or multiple storage pools. Metrics of the LINSTOR satellite set’s storage pools are exposed by the operator. Therefore the system administrator saves a lot of time, because the LINSTOR Kubernetes Operator automates a lot of standard tasks, which were previously performed manually.

Future work

There is a lot of possible future work that can be done. Some tasks are obvious, some will be driven by actual input of our users. For example we think that configuring storage is one of the pain points for our uses. Therefore, having a sidecar container that can discover and prepare storage pools to be consumed by LINSTOR might be a good idea. In a dynamic environment such as Kubernetes it might be worthwhile to handle nodes failures in clever ways. We also have container images that can inject the DRBD kernel module into the running host kernel, which could help users with getting started with DRBD. High Availability is always an important topic and related to that using etcd as database backend. Further, we want to tackle one of the core tasks of Operators, which is managing upgrades from one LINSTOR controller version to the next.

Thanks to Hayley for reviewing this blog post, she is too busy doing the actual work. This is just a subset of the capabilities we plan for the linstor-operator. Stay tuned for more information!

 

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.

NuoDB on LINSTOR-Provisioned Kubernetes Volumes

Introduction:

NuoDB and LINBIT put our technologies together to see just how well they performed, and we are both happy with the results. We decided to compare LINSTOR provisioned volumes against Kubernetes Hostpath (Direct Attached Storage) in a Kubernetes cluster hosted in Google’s cloud platform (GCP) to show that our on-prem testing results can also be proven in a popular cloud-computing environment.

Background:

NuoDB is an ANSI SQL standard and ACID transactional compliant container-native distributed OLTP database that provides responsive scalability and continuous availability. This makes it a great choice for your distributed applications running in cloud provider-managed and open source Kubernetes environments, such as GKE, EKS, AKS and Red Hat OpenShift. As you scale-out NuoDB Transaction Engines in your cluster, you’re scaling out the database’s capacity to process dramatically more SQL transactions per second, and at the same time, building in process redundancy to ensure the database — and applications — are always on.

As you scale your database, you also need to scale the storage that the database is using to persist its data. This is usually where things get sluggish. Highly-scalable storage isn’t always highly-performant, and it seems most of the time the opposite is true. Highly-scalable, highly-performant storage is the niche that LINSTOR aims to fill.

LINSTOR, LINBIT’s SDS software, can be used to deploy DRBD devices in large scale storage clusters. DRBD devices are expected to be about as fast as the backing disk they were carved from, or as fast as the network device DRBD is replicating over (if DRBD’s replication is enabled). At LINBIT we usually aim for a performance impact of less than 5% when using DRBD replication in synchronous mode.

The LINSTOR CSI (container storage interface) driver for Kubernetes allows you to dynamically provision LINSTOR provisioned block devices as persistent volumes for your container workloads… you see where I’m going… 🙂

Testing:

I spun up a 3-node GKE (Google’s Kubernetes Engine) cluster in GCP, and customized the standard node type with 6-vCPU and 22GB of memory for each node:

gke-nodes

When using GKE to spin up a Kubernetes cluster, you’re provided with a “standard” storage class by default. This “standard” storage class dynamically provisions and attaches GCE standard disks to your containers that need persistent volumes. Those GCE standard disks are the pseudo “hostpath” device we wanted to compare against, so we deployed NuoDB into the cluster, and ran a YCSB (Yahoo Cloud Serving Benchmark) SQL workload against it to generate our baseline:

gke-insights-overview

Using the NuoDB Insights visual monitoring tool (comes as standard equipment with NuoDB), we can see in the chart above we had 3 TEs (Transaction Engine) pods feeding into 1 SM (Storage Manager) pod. We can also see that our Aggregate Transaction Rate (TPS) is hovering just over 15K transactions per second. Also, as a side note, this deployment created 5-GCE Standard disks in my Google Cloud Engine account.

LINSTOR provisions its storage from an established LINSTOR cluster, so for our LINSTOR comparison, I had to stand up Kubernetes on GCE nodes “the hard way” so I could also stand up a LINSTOR cluster on the nodes (see LINBIT’s user’s guide or LINSTOR quickstart for more on those steps). I created 4-nodes as VM Instances in Kubernetes. 3-nodes were setup to mimic the GKE cluster, each with 6-vCPU and 22GB of memory, and our 1-master-node – with master node taint in Kubernetes so we will not schedule pods on this node – with 2-vCPU and 16GB of memory. Google recommended I scale these nodes back to save money, so I did that, resulting in the following VM instances:

gce-nodes

After setting up the LINSTOR and Kubernetes cluster in the GCE VM Instances, I attached a single “standard” GCE disk to each node for LINSTOR to provision persistent volumes from, and deployed the same NuoDB distributed database stack and YCSB workload into the cluster:

gce-insights-overview

After letting the benchmarks run for some time, I could see that we were hovering just under 15k, which is within the expected 5% of our ~15k baseline!

Conclusion:

You might be thinking, “That’s good and all, but why not just use GKE with the GCE-backed ‘standard’ storage class?” The answer is features. Using LINSTOR to provide storage to your container platform enables you to:

  • Add replicas of volumes for resiliency at the storage layer – including remote replicas
  • Use replicas of your volumes in DRBD’s read balancing policies which could increase your read speeds beyond what’s possible from a single volume
  • Provide granular control of snapshots at either the Kubernetes or LINSTOR-level
  • Provide the ability to clone volumes from snapshots
  • Enable transparently encrypted volumes
  • Provide data-locality or accessibility policies
  • Lower managerial overhead in terms of the number of physical disks (comparing one GCE disk for each PV with GKE vs. one GCE disk for each storage node with LINSTOR).

Ultimately, the combination of NuoDB and LINSTOR enables clients to run high-performance persistent databases in the cloud or on premise with ease-of scale and “always-on” resiliency. So far, after testing both proprietary and open-source software, NuoDB has found that LINSTOR’s open-source SDS is a production-ready, high-performance, and highly reliable storage solution to provision persistent volumes.

 

Matt Kereczman on Linkedin
Matt Kereczman
Matt is a Linux Cluster Engineer at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT’s support team, and plays an important role in making LINBIT’s support great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt’s hobbies.

Coming Soon, a New DRBD Proxy Release

One major part of LINBIT Disaster Recovery is DRBD Proxy, which helps DRBD with long-distance real-time replication. DRBD Proxy mitigates bandwidth, latency, and distance issues by buffering writes into memory, ensuring that your WAN latency doesn’t become your disk throughput.

The upcoming release of the DRBD Proxy will come with a few new tools to improve data replication with compression. Its LZ4 plugin has been updated to the latest version 1.9.0 and Zstandard algorithm has been added as a brand-new plugin.

Both offer great balance of compression ratio and speed while offering higher replication performance on the DRBD end. In our test cases, both performed considerably better than without compression in overall read and write operations.

Here’s a short synopsis of some of the tests we ran. For this setup, we built a two-node DRBD cluster that was geographically separated. Both ran the latest yet-to-be-released version of the DRBD Proxy for various IO tests. The compression level for the Zstandard was 3, which is the default level between 0 and 22. LZ4 was set to level 9, which is the maximum level.

MySQL Read Write Operations with Sysbench

In this scenario, we used sysbench to perform random reads and writes to a MySQL database replicated on both nodes with DRBD Proxy and DRBD. Sysbench created a random database mounted on a 200MB DRBD volume with Thin LVM backing. Then it performed random transactions for 100 seconds.

The improved number of writes and overall transactions with compression is pretty clear compared to the ‘Proxy Only’ numbers. Interestingly, LZ4 and Zstandard both performed quite similarly.

MySQL Average Latency on MySQL RW Tests

The average latency from the same MySQL tests showed another interesting fact. When using DRBD Proxy, DRBD uses protocol A, which is an asynchronous mode. This setup in the test performed quite nicely compared to replicating with protocol C, the default synchronous mode. All three proxy configurations, regardless of the compression, performed very well against synchronous mode. The different modes of DRBD transport are explained here.

Other random IO tests performed with sysbench on the file system as well as fio tests at the block level mirrored the results shown above, where compression with proxy helped greatly with reducing the network payload while increasing overall read/write performance.

This was a quick preview of the upcoming DRBD Proxy release highlighting its compression plugins. Please stay tuned for the release announcement, and for any questions or comments, feel free to reach me in the comments below.

P.S. The test nodes were configured relatively light. The local node was a 4-core VM with 1GB of RAM running Ubuntu 18.04 and DRBD v9.0.18. The remote node was a 4-core VM with 4GB of RAM also running the same OS and DRBD. The WAN link was restricted to 2MB/s. The relevant sysbench commands used were:

sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-db=sbtest --db-driver=mysql --tables=10 --table-size=1000 prepare
sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-db=sbtest --db-driver=mysql --tables=10 --table-size=1000 --report-interval=10 --threads=4 --time=100 run
Woojay Poynter
IO Plumber
Woojay is working on data replication and software-defined-storage with LINSTOR, built on DRBD @LINBIT. He has worked on web development, embedded firmwares, professional culinary education, power carving with ice and wood. He is a proud father and likes to play with legos.

DRBD/LINSTOR vs Ceph – a technical comparison

INTRODUCTION

The aim of this article is to give you some insight into CEPH, DRBD and LINSTOR by outlining their basic functions. The following points should help you compare these products and to understand which is the right solution for your system. Before we start, you should be aware of the fact that LINSTOR is made for DRBD and that it is highly recommended for you to use LINSTOR if you are also using DRBD.

DRBD

DRBD works by inserting a thin layer in between the file system, the buffer cache, and the disk driver. The DRBD kernel module captures all requests from the file system and splits them down into two paths. So, how does the actual communication occur? How do two separate servers optimize data protection?

DRBD facilitates communication by mirroring two separate servers. One server, although passive, is usually a direct copy of the other. Any data written to the primary server is simultaneously copied to the secondary server through a real-time communication system. The passive server immediately replicates any changes made in the data.

DRBD 8.x works on two nodes at a time. One is given the role of the primary node while the other is given a secondary role. Reads and writes can only occur on the primary node.

THE BENEFITS OF DRBD 9

The features of DRBD 9.x are a vast improvement over the 8.x version. It is now possible to have up to 32 replicas, including the primary node. This gives you the ability to build your cluster setup with what we call diskless nodes, meaning you don’t have to use storage on your primary node. The primary node in diskless mode still has a DRBD block device, but the data is accessed on the secondary nodes over the network.

The secondary nodes must not mount the file system, not even in read-only mode. While it is true to say that the secondary nodes see all updates on the primary node, they can’t expose these updates to the file system, as DRBD is completely file system agnostic.

One write goes to the actual disk and another to the mirrored disks on a peer node. If the first one fails, the file system can be displayed on one of the opposing nodes and the data will be available for use.

DRBD has no precise knowledge of the file system and, as such, it has no way of communicating the changes upstream to the file system driver. The two-at-a-time rule does not actually limit DRBD from operating on more than two nodes.

Moreover, DRBD-9.x supports multiple peer nodes, meaning one peer might be a synchronous mirror in the local data-center while another secondary might be an asynchronous mirror in a remote site.

Again, the passive server only becomes functional when the primary one fails. When such a failure occurs, Pacemaker immediately recognizes the mishap and shifts to the secondary server. This shifting process, nevertheless, is optional – it can be either manual or automatic. For users who prefer manual, one is required to authorize the system to shift to the passive server when the primary one fails.

LINSTOR

In greater IT infrastructures, cluster managing software is state of the art. This is why LINBIT developed LINSTOR, a software on top of DRBD. DRBD itself is a perfect tool to replicate and access your data, especially when it comes to performance. LINSTOR makes configuring DRBD on a system with more than a few nodes an easy task. LINSTOR manages DRBD and gives you the ability to set it up on a large system.

LINSTOR uses a controller service for managing your cluster and a satellite service which runs on every node for deploying DRBD. The controller can be accessed from every node and enables you to monitor and configure your structure quickly. It can be controlled over REST from the outside and provides a very clear CLI. Furthermore, the LINSTOR REST-API gives you the ability to use LINSTOR volumes in Kubernetes, Proxmox VE, OpenNebula and Openstack.

LINSTOR has a feature to maintain the system at work: There is a separation of control plane vs. data plane. If you wanna upgrade or maintain LINSTOR, there is no downtime of the volumes. In comparison with Ceph, DRBD & LINSTOR are easier to troubleshoot, recover, repair, debug, and easier to intervene manually if required, also mainly due to its simplicity. For sys admin the better maintainability and a less complex environment can be crucial. The higher availability also results in a better reliability. For instance DRBD can be started/stopped manually even if LINSTOR is offline, or, for recovery purposes, even without DRBD installed (simply mount backend storage) – compared to that, trying to find any of your data on disks managed by Ceph can be a quite challenge if your Ceph system is down.

In summary, if you’re looking for increased performance, fast configuration, and filesystem-based storage for your applications, use LINSTOR and DRBD. If you’re looking to run LINSTOR with HA, however, you must use a third-party software such as Pacemaker.

 

CEPH

CEPH is an open source software intended to provide highly scalable object, block, and file-based storage in a unified system.

CEPH consists of a RADOS cluster and its interfaces. The RADOS cluster is a system with services for monitoring and storing data across many nodes. CEPH/RADOS is an object storage cluster with no single-point of failure. This is solved by using an algorithm which cuts the data into blocks and spreads them across the RADOS cluster by using self-managing services. The CRUSH algorithm is used to spread the data on upload and to put the blocks together if an object is requested. CEPH is able to use simple data replication as well as erasure coding for those striped blocks.

On top of the RADOS cluster, LIBRADOS is used to upload or request data from the cluster. CEPH uses LIBRADOS for interfaces CEPHFS, RBD and RADOSGW.

CEPHFS gives you the ability to create a filesystem on a host where the data is stored in the CEPH cluster. Additionally, for using CEPHFS, CEPH needs metadata servers which manage the metadata and balance the load for requests among each other.

RBD or RADOS block device is used for creating virtual block devices on hosts with a CEPH cluster, managing and storing the data in the background. Since RBD is built on LIBRADOS, RBD inherits LIBRADOS’s abilities, including read only snapshots and reverts to snapshot. By striping images across the cluster, CEPH improves read access performance for large block device images. The block device can be virtualized, providing block storage to virtual machines in virtualization platforms such as Apache CloudStack, OpenStack, OpenNebula, Ganeti, and Proxmox Virtual Environment.

RADOSGW is the REST-API for communicating with CEPH/RADOS when uploading and requesting data from the cluster.

In general, CEPH is an object storage cluster with the advantage that you do not have to worry about failing nodes or storage drives, because CEPH recognizes failing devices and replicates the data instantly to another disk where it will be accessed. This also leads to a heavy network load if the devices fail.

Striping data comes with a disadvantage in that it is not possible to access the data on a storage drive by mounting it somewhere else or without a working CEPH cluster.

In conclusion, CEPH is the right solution if you are looking for object storage in your infrastructure. Due to its complexity, you have to expect less performance in comparison to DRBD which is only limited by your network speed. 

Daniel Kaltenböck on Email
Daniel Kaltenböck
Software Engineer at LINBIT HA Solutions GmbH
Daniel Kaltenböck studied technical computer science at the Vienna University of Technology. He is a software engineer by heart with a special focus on software defined storage.

Cheap Votes: DRBD Diskless Quorum

One of the most important considerations when implementing clustered systems is ensuring that a cluster remains cohesive and stable given unexpected conditions. DRBD already has fencing mechanisms and even a system of quorum, which is now capable of using a diskless arbitrator to break ties without requiring additional storage beyond that of two nodes.

Quorum and Fencing With a Healthy Dose of Reality

DRBD’s quorum implementation allows resources to vote on availability, taking into account connection state and disk state. While a DRBD cluster without quorum will allow promotion and writes on any node with “UpToDate” data, DRBD with quorum enabled adds the requirement that this node must also be in contact with either a majority of healthy nodes in the cluster, or a minimum amount of nodes as defined statically. This requires at least three nodes, and works best with odd numbers of nodes. A DRBD cluster with quorum enabled cannot become split-brain.

Fencing on the other hand, employs a mechanism to ensure node state by isolating or powering off a node in some way so that unhealthy nodes can be guaranteed to not provide services (by virtue of being assuredly offline). While the use case for fencing and quorum overlap by a large degree, fencing can automatically eject or recover misbehaving nodes, while quorum simply ensures that they cannot modify data.

It is possible to utilize scripts that are triggered in response to changes in quorum as a simple but effective fencing system via a “suicide” method — configuring a node to automatically reset or power itself off upon loss of quorum (accomplished via the “on-quorum-loss” handler in DRBD’s configuration). However, fully-fledged fencing methods via Pacemaker have much more logic behind them, can work even when the node to be fenced is entirely unresponsive, and make Pacemaker clusters “aware” of fencing actions.

The most important element to consider is that while both methods prevent split-brain conditions, quorum does not wholly and entirely replace out-of-band fencing. However, it comes extremely close, and in fact, close enough to eschew Pacemaker-based fencing in many configurations in favor of only quorum where fencing via privileged APIs (as is common in clouds) or dedicated fencing hardware (such as network PDUs or IPMI cards) is less than possible or desired.

Arbitrators!

Before now, in order for a DRBD resource to have three votes across three nodes for quorum, it needed three replicas of data. This was cost prohibitive in some scenarios, so additional logic was added to allow a diskless “arbitrator” node that does not participate in replication. Thusly, the diskless DRBD arbitrator was born.

The concept is fairly simple; rather than require a minimum of three replicas in a DRBD resource to enable quorum functionality, one can now use two replicas (or “data” nodes) with a third DRBD node in a permanently and intentionally diskless state as an “arbitrator” for breaking ties.

The same concepts of traditional DRBD quorum apply, with one significant exception: In a replica 2+A cluster, one node can be lost or disconnected without losing quorum — just like a replica 3 cluster. However, that arbitrator node cannot (on its own) participate in restoring quorum after it is lost.

The reason for this exception is simple: The arbitrator node has no disk. Without a disk, there is no way to independently determine whether data is valid, inconsistent, or related to the cluster at all because there is no data on that node to compare replicas with. While an arbitrator node cannot restore quorum to a single other inquorate data node, two data nodes may establish or re-establish quorum with each other. This is highly effective, and conquers the vast majority of quorum decisions at roughly 66% the cost of a replica 3 cluster.

Arbitrator Nodes in Action

I will not abide this level of grandstanding without a demonstration of this ability (and hopefully some revealing use cases), so below are some brief test results from a replica 2+A geo cluster. Behold:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b role:Secondary

peer-disk:UpToDate

geo-nfs-c role:Secondary

peer-disk:Diskless

As you can see, everything is happy. All of these nodes are connected and up to date. Nodes “geo-nfs-a” and “geo-nfs-b” are data nodes with disks. The node “geo-nfs-c” is a diskless DRBD arbitrator as well as a Booth arbitrator, and quorum has been enabled in this geo cluster (though that’s not reflected in this output). Geo clusters can be tricky to manage the datapath of, since they often operate outside of the scope of rapid decision-making mechanisms and even more often don’t have a method of fencing “sites” adequately. Using DRBD quorum in this case allows split-brains to be entirely prevented globally, rather than depending on several disconnected cluster controllers to manage things. This is much more stable, but requiring three sites with at least one full data replica each is very bandwidth-intensive as well as expensive. This is a perfect fit for an arbitrator node.

If we take one of the two data nodes offline, the cluster will still run. We’re still in contact with the arbitrator, and as long as we don’t lose that contact, quorum will be held:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b connection:Connecting

geo-nfs-c role:Secondary

peer-disk:Diskless

So let’s make it unhappy. If we take the majority of nodes offline this cluster will freeze, suspending I/O and protecting data from split-brain:

[email protected]:~# drbdadm status export-able

export-able role:Primary suspended:quorum

disk:UpToDate quorum:no blocked:upper

geo-nfs-b connection:Connecting

geo-nfs-c connection:Connecting

Reconnecting only the arbitrator node will not result in a quorate cluster, as that arbitrator has no way of knowing whether that data node is actually valid:

[email protected]:~# drbdadm status export-able

export-able role:Primary suspended:quorum

disk:UpToDate quorum:no blocked:upper

geo-nfs-b connection:Connecting

geo-nfs-c role:Secondary

peer-disk:Diskless

Connecting the peer data node will result in I/O resuming even if the arbitrator is still not functioning:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b-0 role:Secondary

peer-disk:UpToDate

geo-nfs-c connection:Connecting

Conclusion

I was able to use a Booth arbitrator node as a DRBD arbitrator node as well, both managing the cluster application state as well as securing the datapath against corruption with almost zero bandwidth usage beyond that of a 2N system. This is clearly a potent use-case and could not be more simple.

This new quorum mechanism could be applied identically to local high availability clusters, allowing reliable quorate systems to be established using a very low power third node. This can help to cheaply circumvent environmental problems that prevent adequate fencing, such as generic platform-agnostic deployment models, security-restricted environments, and even total lack of out-of-band fencing mechanisms (such as some public clouds or specialized hardware).

For posterity, the following DRBD configuration was used to accomplish this. Keep in mind, this was a geo cluster, so it’s using asynchronous replication (protocol A). Protocol C would be used for synchronous local replication:

# /etc/drbd.conf

global {

    usage-count yes;

}

common {

    options {

           auto-promote     yes;

           quorum           majority;

    }

}

resource export-able {

    volume 0 {

           device           minor 0;

           disk             /dev/drbdpool/export-able;

           meta-disk        internal;

    }

    on geo-nfs-a {

           node-id 0;

           address          ipv4 10.1.0.100:7000;

    }

    on geo-nfs-b {

           node-id 1;

           address          ipv4 10.2.0.100:7000;

    }

    on geo-nfs-c {

           node-id 2;

           volume 0 {

                   device       minor 0;

                   disk         none;

           }

           address          ipv4 10.3.0.100:7000;

    }

    connection-mesh {

           hosts geo-nfs-a geo-nfs-b geo-nfs-c;

           net {

                   protocol A;

           }

    }

}

David Hay on Linkedin
David Hay
Cluster Daemon at LINBIT
A long-time Linux system engineer, David Hay finds FOSS solutions to global problems as a Cluster Engineer at LINBIT. David started out with open source software back in the Linux 2.4 days, since then having planned and implemented countless clustered systems, leveraging HA and cloud technologies to great effect. When not liberating the enterprise world with free and open software, he spends his time tinkering with electronics and metalworking.
linstor_value_store

Key/Value Store in LINSTOR

Recently we introduced a Key/Value store in LINSTOR and exposed it in a developer-friendly way in the Python API (python-linstor). The first question is why would one want such a Key/Value store in LINSTOR when there are many high performance implementations such as etcd. The request for a K/V store was mainly driven by LINSTOR plugin developers. For example many plugins need to store some kind of meta data like a description for a resource. Existing, non-LINSTOR plugins sometimes store such information in a local json file or in a file per resource. This, on one hand, is clumsy and on the other hand in a distributed system like DRBD/LINSTOR, the data needs be available on all nodes.

In LINSTOR a K/V store has a unique name (e.g., one per plugin) and it can store up to 510 bytes for a key, and 4096 bytes for the value. The implementation in python-linstor provides an interface that mimics a Python3 dictionary. In addition to the discussed unique name, the K/V store as implemented in the Python library also provides so called namespaces. One can think of a namespace as a UNIX directory structure where components of a path (i.e., the namespace) are separated by a /. In the following we show an example using the Python library:

import linstor
kv = linstor.KV('myKV', namespace='/foo/bar/')
kv['key'] = 'val'
list(kv.items()) -> [('key', 'val')]
kv.namespace = '/'
list(kv.items()) -> [('/foo/bar/key', 'val')]
kv['foo/baz/key'] = 'valbaz'
kv.namespace = '/foo/bar'
list(kv.items()) -> [('key', 'val')] # keys in /foo/baz not visible

Key/Value Store makes life easier

Developers already familiar with LINSTOR details might know there is a concept that sounds similar to what the K/V store can do, the so called “AUX props”. One can attach meta data to basically every LINSTOR object. While they sound similar, there are noteworthy differences:

• An AUX prop is tied to the according object. When the object is gone, the meta data is gone. This might be desired and can be an advantage.

• The K/V store exists as long as the LINSTOR cluster exists. Data is not attached to another LINSTOR object. Depending on the situation this might be an advantage compared to a plain AUX property.

• The K/V store has a much nicer interface. It just behaves like a Python dictionary.

• The K/V store and its namespace implementation make it a lot easier to store hierarchical data.

• Searching AUX props can be difficult: For example to find a specific AUX prop set on volume definition, one would have to iterate over the AUX props of.

All in all the K/V store makes the life of a plugin developer much easier. BTW: The text of this blog post easily fits into a single K/V pair 😀

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.