DRBD/LINSTOR vs Ceph – a technical comparison

INTRODUCTION

The aim of this article is to give you some insight into CEPH, DRBD and LINSTOR by outlining their basic functions. The following points should help you compare these products and to understand which is the right solution for your system. Before we start, you should be aware of the fact that LINSTOR is made for DRBD and that it is highly recommended for you to use LINSTOR if you are also using DRBD.

DRBD

DRBD works by inserting a thin layer in between the file system, the buffer cache, and the disk driver. The DRBD kernel module captures all requests from the file system and splits them down into two paths. So, how does the actual communication occur? How do two separate servers optimize data protection?

DRBD facilitates communication by mirroring two separate servers. One server, although passive, is usually a direct copy of the other. Any data written to the primary server is simultaneously copied to the secondary server through a real-time communication system. The passive server immediately replicates any changes made in the data.

DRBD 8.x works on two nodes at a time. One is given the role of the primary node while the other is given a secondary role. Reads and writes can only occur on the primary node.

THE BENEFITS OF DRBD 9

The features of DRBD 9.x are a vast improvement over the 8.x version. It is now possible to have up to 32 replicas, including the primary node. This gives you the ability to build your cluster setup with what we call diskless nodes, meaning you don’t have to use storage on your primary node. The primary node in diskless mode still has a DRBD block device, but the data is accessed on the secondary nodes over the network.

The secondary nodes must not mount the file system, not even in read-only mode. While it is true to say that the secondary nodes see all updates on the primary node, they can’t expose these updates to the file system, as DRBD is completely file system agnostic.

One write goes to the actual disk and another to the mirrored disks on a peer node. If the first one fails, the file system can be displayed on one of the opposing nodes and the data will be available for use.

DRBD has no precise knowledge of the file system and, as such, it has no way of communicating the changes upstream to the file system driver. The two-at-a-time rule does not actually limit DRBD from operating on more than two nodes.

Moreover, DRBD-9.x supports multiple peer nodes, meaning one peer might be a synchronous mirror in the local data-center while another secondary might be an asynchronous mirror in a remote site.

Again, the passive server only becomes functional when the primary one fails. When such a failure occurs, Pacemaker immediately recognizes the mishap and shifts to the secondary server. This shifting process, nevertheless, is optional – it can be either manual or automatic. For users who prefer manual, one is required to authorize the system to shift to the passive server when the primary one fails.

LINSTOR

In greater IT infrastructures, cluster managing software is state of the art. This is why LINBIT developed LINSTOR, a software on top of DRBD. DRBD itself is a perfect tool to replicate and access your data, especially when it comes to performance. LINSTOR makes configuring DRBD on a system with more than a few nodes an easy task. LINSTOR manages DRBD and gives you the ability to set it up on a large system.

LINSTOR uses a controller service for managing your cluster and a satellite service which runs on every node for deploying DRBD. The controller can be accessed from every node and enables you to monitor and configure your structure quickly. It can be controlled over REST from the outside and provides a very clear CLI. Furthermore, the LINSTOR REST-API gives you the ability to use LINSTOR volumes in Kubernetes, Proxmox VE, OpenNebula and Openstack.

In summary, if you’re looking for increased performance, fast configuration, and filesystem-based storage for your applications, use LINSTOR and DRBD. If you’re looking to run LINSTOR with HA, however, you must use a third-party software such as Pacemaker.

 

CEPH

CEPH is an open source software intended to provide highly scalable object, block, and file-based storage in a unified system.

CEPH consists of a RADOS cluster and its interfaces. The RADOS cluster is a system with services for monitoring and storing data across many nodes. CEPH/RADOS is an object storage cluster with no single-point of failure. This is solved by using an algorithm which cuts the data into blocks and spreads them across the RADOS cluster by using self-managing services. The CRUSH algorithm is used to spread the data on upload and to put the blocks together if an object is requested. CEPH is able to use simple data replication as well as erasure coding for those striped blocks.

On top of the RADOS cluster, LIBRADOS is used to upload or request data from the cluster. CEPH uses LIBRADOS for interfaces CEPHFS, RBD and RADOSGW.

CEPHFS gives you the ability to create a filesystem on a host where the data is stored in the CEPH cluster. Additionally, for using CEPHFS, CEPH needs metadata servers which manage the metadata and balance the load for requests among each other.

RBD or RADOS block device is used for creating virtual block devices on hosts with a CEPH cluster, managing and storing the data in the background. Since RBD is built on LIBRADOS, RBD inherits LIBRADOS’s abilities, including read only snapshots and reverts to snapshot. By striping images across the cluster, CEPH improves read access performance for large block device images. The block device can be virtualized, providing block storage to virtual machines in virtualization platforms such as Apache CloudStack, OpenStack, OpenNebula, Ganeti, and Proxmox Virtual Environment.

RADOSGW is the REST-API for communicating with CEPH/RADOS when uploading and requesting data from the cluster.

In general, CEPH is an object storage cluster with the advantage that you do not have to worry about failing nodes or storage drives, because CEPH recognizes failing devices and replicates the data instantly to another disk where it will be accessed. This also leads to a heavy network load if the devices fail.

Striping data comes with a disadvantage in that it is not possible to access the data on a storage drive by mounting it somewhere else or without a working CEPH cluster.

In conclusion, CEPH is the right solution if you are looking for object storage in your infrastructure. Due to its complexity, you have to expect less performance in comparison to DRBD which is only limited by your network speed. 

Cheap Votes: DRBD Diskless Quorum

One of the most important considerations when implementing clustered systems is ensuring that a cluster remains cohesive and stable given unexpected conditions. DRBD already has fencing mechanisms and even a system of quorum, which is now capable of using a diskless arbitrator to break ties without requiring additional storage beyond that of two nodes.

Quorum and Fencing With a Healthy Dose of Reality

DRBD’s quorum implementation allows resources to vote on availability, taking into account connection state and disk state. While a DRBD cluster without quorum will allow promotion and writes on any node with “UpToDate” data, DRBD with quorum enabled adds the requirement that this node must also be in contact with either a majority of healthy nodes in the cluster, or a minimum amount of nodes as defined statically. This requires at least three nodes, and works best with odd numbers of nodes. A DRBD cluster with quorum enabled cannot become split-brain.

Fencing on the other hand, employs a mechanism to ensure node state by isolating or powering off a node in some way so that unhealthy nodes can be guaranteed to not provide services (by virtue of being assuredly offline). While the use case for fencing and quorum overlap by a large degree, fencing can automatically eject or recover misbehaving nodes, while quorum simply ensures that they cannot modify data.

It is possible to utilize scripts that are triggered in response to changes in quorum as a simple but effective fencing system via a “suicide” method — configuring a node to automatically reset or power itself off upon loss of quorum (accomplished via the “on-quorum-loss” handler in DRBD’s configuration). However, fully-fledged fencing methods via Pacemaker have much more logic behind them, can work even when the node to be fenced is entirely unresponsive, and make Pacemaker clusters “aware” of fencing actions.

The most important element to consider is that while both methods prevent split-brain conditions, quorum does not wholly and entirely replace out-of-band fencing. However, it comes extremely close, and in fact, close enough to eschew Pacemaker-based fencing in many configurations in favor of only quorum where fencing via privileged APIs (as is common in clouds) or dedicated fencing hardware (such as network PDUs or IPMI cards) is less than possible or desired.

Arbitrators!

Before now, in order for a DRBD resource to have three votes across three nodes for quorum, it needed three replicas of data. This was cost prohibitive in some scenarios, so additional logic was added to allow a diskless “arbitrator” node that does not participate in replication. Thusly, the diskless DRBD arbitrator was born.

The concept is fairly simple; rather than require a minimum of three replicas in a DRBD resource to enable quorum functionality, one can now use two replicas (or “data” nodes) with a third DRBD node in a permanently and intentionally diskless state as an “arbitrator” for breaking ties.

The same concepts of traditional DRBD quorum apply, with one significant exception: In a replica 2+A cluster, one node can be lost or disconnected without losing quorum — just like a replica 3 cluster. However, that arbitrator node cannot (on its own) participate in restoring quorum after it is lost.

The reason for this exception is simple: The arbitrator node has no disk. Without a disk, there is no way to independently determine whether data is valid, inconsistent, or related to the cluster at all because there is no data on that node to compare replicas with. While an arbitrator node cannot restore quorum to a single other inquorate data node, two data nodes may establish or re-establish quorum with each other. This is highly effective, and conquers the vast majority of quorum decisions at roughly 66% the cost of a replica 3 cluster.

Arbitrator Nodes in Action

I will not abide this level of grandstanding without a demonstration of this ability (and hopefully some revealing use cases), so below are some brief test results from a replica 2+A geo cluster. Behold:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b role:Secondary

peer-disk:UpToDate

geo-nfs-c role:Secondary

peer-disk:Diskless

As you can see, everything is happy. All of these nodes are connected and up to date. Nodes “geo-nfs-a” and “geo-nfs-b” are data nodes with disks. The node “geo-nfs-c” is a diskless DRBD arbitrator as well as a Booth arbitrator, and quorum has been enabled in this geo cluster (though that’s not reflected in this output). Geo clusters can be tricky to manage the datapath of, since they often operate outside of the scope of rapid decision-making mechanisms and even more often don’t have a method of fencing “sites” adequately. Using DRBD quorum in this case allows split-brains to be entirely prevented globally, rather than depending on several disconnected cluster controllers to manage things. This is much more stable, but requiring three sites with at least one full data replica each is very bandwidth-intensive as well as expensive. This is a perfect fit for an arbitrator node.

If we take one of the two data nodes offline, the cluster will still run. We’re still in contact with the arbitrator, and as long as we don’t lose that contact, quorum will be held:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b connection:Connecting

geo-nfs-c role:Secondary

peer-disk:Diskless

So let’s make it unhappy. If we take the majority of nodes offline this cluster will freeze, suspending I/O and protecting data from split-brain:

[email protected]:~# drbdadm status export-able

export-able role:Primary suspended:quorum

disk:UpToDate quorum:no blocked:upper

geo-nfs-b connection:Connecting

geo-nfs-c connection:Connecting

Reconnecting only the arbitrator node will not result in a quorate cluster, as that arbitrator has no way of knowing whether that data node is actually valid:

[email protected]:~# drbdadm status export-able

export-able role:Primary suspended:quorum

disk:UpToDate quorum:no blocked:upper

geo-nfs-b connection:Connecting

geo-nfs-c role:Secondary

peer-disk:Diskless

Connecting the peer data node will result in I/O resuming even if the arbitrator is still not functioning:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b-0 role:Secondary

peer-disk:UpToDate

geo-nfs-c connection:Connecting

Conclusion

I was able to use a Booth arbitrator node as a DRBD arbitrator node as well, both managing the cluster application state as well as securing the datapath against corruption with almost zero bandwidth usage beyond that of a 2N system. This is clearly a potent use-case and could not be more simple.

This new quorum mechanism could be applied identically to local high availability clusters, allowing reliable quorate systems to be established using a very low power third node. This can help to cheaply circumvent environmental problems that prevent adequate fencing, such as generic platform-agnostic deployment models, security-restricted environments, and even total lack of out-of-band fencing mechanisms (such as some public clouds or specialized hardware).

For posterity, the following DRBD configuration was used to accomplish this. Keep in mind, this was a geo cluster, so it’s using asynchronous replication (protocol A). Protocol C would be used for synchronous local replication:

# /etc/drbd.conf

global {

    usage-count yes;

}

common {

    options {

           auto-promote     yes;

           quorum           majority;

    }

}

resource export-able {

    volume 0 {

           device           minor 0;

           disk             /dev/drbdpool/export-able;

           meta-disk        internal;

    }

    on geo-nfs-a {

           node-id 0;

           address          ipv4 10.1.0.100:7000;

    }

    on geo-nfs-b {

           node-id 1;

           address          ipv4 10.2.0.100:7000;

    }

    on geo-nfs-c {

           node-id 2;

           volume 0 {

                   device       minor 0;

                   disk         none;

           }

           address          ipv4 10.3.0.100:7000;

    }

    connection-mesh {

           hosts geo-nfs-a geo-nfs-b geo-nfs-c;

           net {

                   protocol A;

           }

    }

}

David Hay on Linkedin
David Hay
Cluster Daemon at LINBIT
A long-time Linux system engineer, David Hay finds FOSS solutions to global problems as a Cluster Engineer at LINBIT. David started out with open source software back in the Linux 2.4 days, since then having planned and implemented countless clustered systems, leveraging HA and cloud technologies to great effect. When not liberating the enterprise world with free and open software, he spends his time tinkering with electronics and metalworking.
linstor_value_store

Key/Value Store in LINSTOR

Recently we introduced a Key/Value store in LINSTOR and exposed it in a developer-friendly way in the Python API (python-linstor). The first question is why would one want such a Key/Value store in LINSTOR when there are many high performance implementations such as etcd. The request for a K/V store was mainly driven by LINSTOR plugin developers. For example many plugins need to store some kind of meta data like a description for a resource. Existing, non-LINSTOR plugins sometimes store such information in a local json file or in a file per resource. This, on one hand, is clumsy and on the other hand in a distributed system like DRBD/LINSTOR, the data needs be available on all nodes.

In LINSTOR a K/V store has a unique name (e.g., one per plugin) and it can store up to 510 bytes for a key, and 4096 bytes for the value. The implementation in python-linstor provides an interface that mimics a Python3 dictionary. In addition to the discussed unique name, the K/V store as implemented in the Python library also provides so called namespaces. One can think of a namespace as a UNIX directory structure where components of a path (i.e., the namespace) are separated by a /. In the following we show an example using the Python library:

import linstor
kv = linstor.KV('myKV', namespace='/foo/bar/')
kv['key'] = 'val'
list(kv.items()) -> [('key', 'val')]
kv.namespace = '/'
list(kv.items()) -> [('/foo/bar/key', 'val')]
kv['foo/baz/key'] = 'valbaz'
kv.namespace = '/foo/bar'
list(kv.items()) -> [('key', 'val')] # keys in /foo/baz not visible

Key/Value Store makes life easier

Developers already familiar with LINSTOR details might know there is a concept that sounds similar to what the K/V store can do, the so called “AUX props”. One can attach meta data to basically every LINSTOR object. While they sound similar, there are noteworthy differences:

• An AUX prop is tied to the according object. When the object is gone, the meta data is gone. This might be desired and can be an advantage.

• The K/V store exists as long as the LINSTOR cluster exists. Data is not attached to another LINSTOR object. Depending on the situation this might be an advantage compared to a plain AUX property.

• The K/V store has a much nicer interface. It just behaves like a Python dictionary.

• The K/V store and its namespace implementation make it a lot easier to store hierarchical data.

• Searching AUX props can be difficult: For example to find a specific AUX prop set on volume definition, one would have to iterate over the AUX props of.

All in all the K/V store makes the life of a plugin developer much easier. BTW: The text of this blog post easily fits into a single K/V pair 😀

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.
Container-kubernetes-linsto

Containerize LINSTOR

LINBIT and its Software-Defined Storage (SDS) solution LINSTOR has provided integration with Linux containers for quite some time. These range from a Docker volume plugin, to a Flexvolume plugin, and recently, a CSI plugin for Kubernetes. While we always provided excellent integration to the container world, most of our software itself was not available as a container/base image. Containerizing our services is a non-trivial task. As you probably know, the core of the DRBD software consists of a Linux kernel module and user space utilities that interact via netlink with this kernel module. Additionally, our software needs to create LVM devices and DRBD block devices within a container. These tasks are interesting and challenging to put into containers. For this article, we assume 3 nodes, one node that acts as a LINSTOR controller, and two that act as satellites. We tested this with recent Centos7 machines and with a current version of Docker.

Prerequisites

In this article, we assume access to our Docker registry hosted on drbd.io. On all hosts you should run the following commands:

docker login drbd.io
Username: YourUserName
Password: YourPassword
Login Succeeded

Installing the DRBD kernel modules

We need the DRBD kernel module and its dependencies on the LINSTOR satellites (the controller does not need access to DRBD). For that we provide a solution for the most common platforms, namely Centos7/RHEL7 and Ubuntu Bionic.

docker run --privileged -it --rm \
  -v /lib/modules:/lib/modules drbd.io/drbd9:rhel7
DRBD modul sucessfully loaded 

What this does is check which kernel is actually executed on the host, then found it the most appropriate package in the container and installed it. We ship the same, unmodified rpm/deb packages in the container as we provide in our customer repositories. If you are using Ubuntu Bionic, you should use the drbd.io/drbd9:bionic container.

Running a LINSTOR controller

docker run -d --name=linstor-controller \
  -p 3376:3376 -p 3377:3377 drbd.io/linstor-controller

The controller does not have any special requirements, it just needs to be accessible to the client via TCP/IP. Please note that in this configuration the controller’s database is not persisted. One possibility is to bind-mount the directory used for the controller’s database by adding
-v /some/dir:/var/lib/linstor .

Running a LINSTOR satellite

docker run -d --name=linstor-satellite --net=host \
 --privileged drbd.io/linstor-satellite 

The satellite is the component that creates actual block devices. On one hand the backing devices (usually LVM) and the actual DRBD block devices. Therefore this container needs access to/dev, and it needs to share the host networking. Host networking is required for the communication between drbdsetup and the actual kernel module.

Configuring the Cluster

We have to set up LINSTOR as usual, which fortunately, is an easy task and has to be done only once. In the spirit of this blog post, let’s use a containerized LINSTOR client as well. As the client obviously has to talk to the controller, we need to tell the client in the container where to find the controller. This is done by setting the environment variable LS_CONTROLLERS.

docker run -it --rm -e LS_CONTROLLERS=Controller \ 
  drbd.io/linstor-client interactive
  ...
- volume-definition (vd)
LINSTOR ==> node create Satellite1 172.42.42.10
LINSTOR ==> node create Satellite2 172.42.42.20
LINSTOR ==> storage-pool-definition create drbdpool
LINSTOR ==> storage-pool create lvm Satellite1 drbdpool drbdpool
LINSTOR ==> storage-pool create lvm Satellite2 drbdpool drbdpool 

Creating a replicated DRBD resource

So far we loaded the kernel module on the satellites, started the controller and satellite containers and configured the LINSTOR cluster. Now it is time to actually create resources.

docker run -it --rm -e LS_CONTROLLERS=Controller \ 
  drbd.io/linstor-client interactive
  ... 
- volume-definition (vd)
 LINSTOR ==> resource-definition create demo
 LINSTOR ==> volume-definition create demo 1G
 LINSTOR ==> resource create demo --storage-pool drbdpool --auto-place 2 

If you have drbd-utils installed on the host, you can now see the DRBD resource as usual viadrbdsetup status. But we can also use a container to do that. On one of the satellites you can run a throw-away linstor-satellite container which contains drbd-utils:

docker run -it --rm --net=host --privileged \
 --entrypoint=/bin/bash drbd.io/linstor-satellite
$ drbdsetup status
$ lvs

Note that by default you will not see the symbolic links for the backing devices created by LVM/udev in the LINSTOR satellite container. That is expected. In the container you will see something like /dev/drbdpool/demo_00000, while on the host you will only see/dev/dm-X, and  lvs will not show the LVs. If you really want to see the LVs on the host, you could execute  lvscan -a --cache, but there is no actual reason for that. One might also map the lvmetad socket to the container.

Summary

As you can see, LINBIT’s container story is now complete. It is now possible to deploy the whole stack via containers. This ranges from the lowest level of providing the kernel modules to the highest level of LINSTOR SDS including the client, the controller, and satellites.

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.
Linbeat - band, linux

A new star is born: LIN:BEAT – The Band

LINBIT is opening up a new division to specifically address our community’s desire to turn the music up to 11! As you all know, LINBIT is famous for its DRBD, Software-Defined Storage (SDS) and Disaster Recovery (DR) solutions. In a paradigm-shifting turn of events by management, LINBIT has decided to expand into the music industry. Since there is so much business potential in playing live concerts, LINBIT has transformed five of their employees into LIN:BEAT – The Band. These concerts are of course made highly-available by utilizing LINBIT’s own DRBD software.

Linbeat - band, linux

The Band will be touring all Cloud and Linux events around the globe in 2019. Band members use self-written code to produce their unique sound design, reminiscent of drum and bass and heavily influenced by folk punk. The urge to portray all the advantages of DRBD and LINSTOR is so strong they had to send a message to the world: Their songs tell the world about the ups and downs of administrators who handle big storage clusters. LIN:BEAT offers a variety of styles in their musical oeuvre: While “LINSTOR” is a heavy-driven rock song and “Snapshots” speaks to funk-loving people, even “Disaster Recovery,” a love ballad, made it into their repertoire. Lead singer, Phil Reisner, sings sotto voce about his lost love — a RAID rack called “SDS”. Reisner told the reporters, “Administrators are such underrated people. This is sadly unfortunate. We strive to give all administrators a voice. Even if it’s a musical one!”

Crowds will be jumping up and down in excitement when LIN:BEAT comes to town! Be there and code fair!

An Excerpt of the song “My first love is DRBD” written by Phil Reisner:

My first love is DRBD,

there has never been a fee

instead it serves proudly as open source

your replication has changed the course

It’s the crucial key for Linux’s destiny

man-person-jumping-desert

LINSTOR grows beyond DRBD

For quite some time, LINSTOR has been able to use NVMe-oF storage targets via the Swordfish API. This was expressed in LINSTOR as a resource definition that contains a single resource with one backing disk (that is the NVMe-oF target) and one diskless resource (that is the NVMe-oF initiator).

Layers in the storage stack

In the last few months the team has been busy making LINSTOR more generic, adding support for resource templates. A resource template describes a storage stack in terms of layers for specific resources/volumes. Here are some examples of such storage stacks:

    • DRBD on top of logic volumes (LVM)
    • DRBD on top of zVols (ZFS)
    • Swordfish initiator & target on top of logic volumes (LVM)
    • DRBD on top of LUKS on top of logic volumes (LVM)
  • LVM only

The team came up with an elegant approach that introduces these additional resource templates in ways that allow existing LINSTOR configurations to keep their semantics as the default resource templates.

With this decoupling, we no longer need to have DRBD installed on LINSTOR clusters that do not require the replication functions of DRBD.

What does that mean for DRBD?

The interests of LINBIT’s customers vary widely. Some want to use LINSTOR without DRBD – which is now supported. A very prominent example of this is Intel, who uses LINSTOR in its Rack Scale Design effort to connect storage nodes and compute nodes with NVMe-oF. In this example, the storage is disaggregated from the other nodes.

Other customers see converged architectures as a better fit. For converged scenarios, DRBD has many advantages over a pure data access protocol such as NVMe-oF. LINSTOR is built from the ground up to manage DRBD, therefore, the need for DRBD support will remain.

Linux-native NVMe-oF and NVMe/TCP

SNIA’s Swordfish has clear benefits with creating a standard for managing storage targets such as allowing optimized storage target implementations, as well as a hardware-accelerated data-path, non-Linux control path.

Due to the fact that Swordfish is an extension of Redfish, which needs to be implemented in the Baseboard Management Controller (BMC), we have decided to extend LINSTOR’s driver set to configure NVMe-oF target and initiator software. We do this by utilizing existing tools found within the Linux operating system, eliminating the need for a Swordfish software stack.

Summary

LINSTOR now supports configurations without DRBD. It is now a unified storage orchestrator for replicated and non-replicated storage.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

Nvme-oF-Linstor-speed

Speed Up! NVMe-oF for LINSTOR

What is NVMe?

The storage world has gained a number of new terms in the last few years. Let’s start with NVMe. The abbreviation stands for Non-Volatile Memory express, which isn’t very self-explanatory. It all began a few years back when NAND Flash started to make major inroads into the storage industry, and the new storage medium needed to be accessed through existing interfaces like SATA and Serial attached SCSI (SAS).

Back at that time, FusionIO created a NAND flash-based SSD that was directly plugged into the PCIe slot of a server. This eliminated the bottleneck of the ATA or SCSI command sets and the interfaces coming from a time of rotating storage media.

The FusionIO products shipped with proprietary drivers, and the industry set forth in creating an open standard that suits the performance of NAND flash. One of the organizations where the players of the industry can meet, align and create standards is the Storage Networking Industry Association ( SNIA).

The first NVMe standard was published in 2013, and it describes a PCIe-based interface and command set to access fast storage. This can be thought of as a cleaned up version of the ATA or SCSI commands plus a PCIe interface.

What is NVMe-oF and NVMe/TCP?

Similar to what iSCSI is to SCSI, NVMe-oF or NVMe/TCP are standards that describe how to send the NVMe commands over networks. NVMe-oF requires a RDMA-capable network (like InfiniBand or RoCE), and NVMe/TCP works on every network that can carry IP traffic.

There are two terms of which to be aware: 1) the initiator is where the applications run that want to access the dataset. Linux comes with a built-in initiator, likewise other OSes already have initiators or will have them soon.

And, 2) the target is where the data is stored. Linux comes with a software target built into the kernel. It might not be obvious that any Linux block device can be made available as a NVMe-oF target using the Linux target software. It is not limited to NVMe devices.

What does this have to do with Swordfish?

While the iSCSI or NVMe-oF standards describe how the READ, WRITE and other operations on block data are shipped from the initiator to the target, they do not describe how a target (volume) gets created or configured. For too many years, this was the realm of vendor specific APIs and GUIs.

SNIA’s Swordfish standard describes how to manage storage targets and make it accessible as NVMe-oF targets. It is a REST API with JSON data. As such, it is easy to understand and embrace.

The major drawback of Swordfish is mainly that it is defined as an extension of Redfish. Redfish is a standard to manage servers over the network. It can be thought of as a modernized IPMI. As such, Redfish will usually be implemented on a Baseboard Management Controller (BMC). While Redfish has its advantages over IPMI, it does not provide something completely new.

On the other hand, Swordfish is something that was not there before, but as it is an extension to Redfish, an implementation of it usually means that the BMC of the machine needs to have a Redfish-enabled BMC, which may hinder or slow down the adoption of Swordfish.

LINSTOR

Since version 0.7, LINSTOR is capable of working with storage provided by Swordfish-compliant storage targets, as well as their initiator counterparts.

Summary

LINSTOR has gained the capability of managing storage on Swordfish/NVMe-oF targets besides working with DRBD and direct attached storage on Linux servers.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

LINSTOR High Level Resource API

High Level Resource API – The simplicity of creating replicated volumes

In this blog post, we present one of our recent extensions to the LINSTOR ecosystem: A high-level, user-friendly Python API that allows simple DRBD resource management via LINSTOR.

Background: So far LINSTOR components communicated by the following means: Via Protocol Buffers, or via the Python API that is used in the linstor command line client. Protocol Buffers are a great way to transport serialized structured data between LINSTOR components, but by themselves they don’t provide the necessary abstraction for developers.

That is not the job of Protocol Buffers. Since the early days we split the command line client into the client logic (parsing configuration files, parsing command line arguments…), and a Python library (python-linstor). This Python library provides all the bits and pieces to interact with LINSTOR. For example it provides a MultiLinstor class that handles TCP/IP communication to the LINSTOR controller. Additionally, it allows all the operations that are possible with LINSTOR (e.g. creating nodes, creating storage pools…). For perfectly valid reasons this API is very low level and pretty close to the actual Protocol Buffer messages sent to the LINSTOR controller.

By developing more and more plugins to integrate LINSTOR into other projects like OpenStack, OpenNebula, Docker Volumes, and many more, we saw that there is need for a higher level abstraction.

Finding the Right Abstraction

The first dimension of abstraction is to abstract from LINSTOR internals. For example it perfectly makes sense that recreating an existing resource is an error on a low level (think of it as EEXIST). On a higher level, depending on the actual object, trying to recreate an object might be perfectly fine and one wants to get the existing object (i.e. idem-potency).

The second dimension of abstraction is from DRBD and LINSTOR as a whole. Developers dealing with storage already have a good knowledge about concepts like nodes, storage pools, resource, volumes, placement policies… This is the part where we can make LINSTOR and DRBD accessible for new developers.

The third goal was to only provide a set of objects that are important in the context of the user/developer. This, for example, means that we can assume that the LINSTOR cluster is already set up, so we do not need to provide a high-level API to add nodes. For the higher-level API we can focus on [LINSTOR] resources. This allows us to satisfy the KISS (keep-it-simple-stupid) principle. A forth goal was to introduce new, higher-level concepts like placement policies. Placement policies/templates are concepts currently developed in core LINSTOR, but we can already provide basics on a higher level.

Demo Time

We start by creating a 10 GB big replicated LINSTOR/DRBD volume in a 3 node cluster. We want the volume to be 2 times redundant. Then we increase the size of the volume to 20 GB.

$ python
>> import linstor

>> foo = linstor.Resource('foo')

>> foo.volumes[0] = linstor.Volume("10 GB")

There are multiple ways to specify the size.

>> foo.placement.redundancy = 2

>> foo.autoplace()

>> foo.volumes[0].size += 10 * (2 ** 30)

This line is enough to resize a replicated volume cluster wide.

We needed 5 lines of code to create a replicated DRBD volume in a cluster! Let that sink in for a moment and compare it to the steps that were necessary without LINSTOR: Creating backing devices on all nodes, writing and synchronizing DRBD res(ource) files, creating meta-data on all nodes, drbdadm up the resource and force one to the Primary role to start the initial sync.

For the next step we assume that the volume is replicated and that we are a storage plugin developer. Our goal is to make sure the volume is accessible on every node because the block device should be used in a VM. So, A) make sure we can access the block device, and B) find out what the name of the block device of the first volume actually is:

>>> foo.activate(socket.gethostname())

>>> print(foo.volumes[0].device_path)

The method activate is one of these methods that shows how we intended abstraction. Note that we autoplaced the resource 2 times in a 3-node cluster. So LINSTOR chose the nodes that fit best. But now we want the resource to be accessible on every node without increasing the redundancy to 3 (because that would need additional storage and 2 times replicated data is good enough).

Diskless clients

Fortunately DRBD has us covered as it has the concept of diskless clients. These nodes provide a local block device as usual, but they read and write data from/to their peers only over the network (i.e. no local storage). Creating this diskless assignment is not necessary if the node was already part of the replication in the first place (then it already has access to the data locally).

This is exactly what activate does: If the node can already access the data – fine, if not, create a diskless assignment. Now assume we are done and we do not need access to the device anymore. We want to do some cleanup because we do not need a diskless assignment:

>>> foo.deactivate(socket.gethostname()) 

The semantic of this method is to remove the assignment if it is diskless (as it does not contribute to actual redundancy), but if it is a node that stores actual data, deactivate does nothing and keeps the data as redundant as it was. This is only a very small subset of the functionality the high-level API provides, there is a lot more to know like creating snapshots, converting diskless assignments to diskful ones and vice versa, or managing DRBD Proxy. For more information check the online documentation.

If you want to go deeper into the LINSTOR universe, please visit our youtube channel.

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.

LINSTOR CSI Plugin for Kubernetes

A few weeks ago, LINBIT publicly released the LINSTOR CSI (Container Storage Interface) plugin. This means LINSTOR now has a standardized way of working with any container orchestration platform that supports CSI. Kubernetes is one of those platforms, so our developers put in the work to make LINSTOR integration with Kubernetes easy, and I’ll show you how!

You’ll need a couple things to get started:

  • Kubernetes Cluster (1.12.x or newer)
  • LINSTOR Cluster

LINSTOR’s CSI plugin requires certain Kubernetes feature gates be enabled on the kube-apiserver and each kubelet.

Enable the CSINodeInfo and CSIDriverRegistry feature gates on the kube-apiserver by adding, --feature-gates=KubeletPluginsWatcher=true,CSINodeInfo=true, to the list of arguments passed to the kube-apiserver system pod in the /etc/kubernetes/manifests/kube-apiserver.yaml manifest. It should look something like this:

# cat /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: null
  labels:
    component: kube-apiserver
    tier: control-plane
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-apiserver
    ... snip ...
    - --feature-gates=KubeletPluginsWatcher=true,CSINodeInfo=true
    ... snip ...

To enable these feature gates on the Kubelet, you’ll need to add the following argument to the KUBELET_EXTRA_ARGS variable located in the /etc/sysconfig/kubelet: --feature-gates=CSINodeInfo=true,CSIDriverRegistry=true. Your config should look something like this:

# cat /etc/sysconfig/kubelet 
KUBELET_EXTRA_ARGS="--feature-gates=CSINodeInfo=true,CSIDriverRegistry=true"

Once you’ve modified those two configurations, you can prepare your configuration for the CSI plugin’s sidecar containers. curl down the latest version of the plugin definition:

# curl -O \
https://raw.githubusercontent.com/LINBIT/linstor-csi/master/examples/k8s/deploy/linstor-csi.yaml

Set the value: of each instance of LINSTOR-IP in the linstor-csi.yaml to the IP address of your LINSTOR Controller. The placeholder IP in the example yaml is 192.168.100.100, so we can use the following command to update this address (or you can edit it with an editor), simply set CON_IP to your controller’s IP address:

# CON_IP="x.x.x.x"; sed -i.example s/192\.168\.100\.100/$CON_IP/g linstor-csi.yaml

Finally, apply the yaml to the Kubernetes cluster:

# kubectl apply -f linstor-csi.yaml

You should now see the linstor-csi sidecar pods running in the kube-system namespace:

# watch -n1 -d kubectl get pods --namespace=kube-system --output=wide

Once running, you can define storage classes in Kubernetes pointing to our LINSTOR storage pools that we can then provision persistent, and optionally replicated by DRBD, volumes from for our containers.

Here is an example yaml definition that describes a LINSTOR storage pool in my cluster named, thin-lvm:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: linstor-autoplace-1-thin-lvm
provisioner: io.drbd.linstor-csi
parameters:
  autoPlace: "1"
  storagePool: "thin-lvm"

And here is an example yaml definition for a persistent volume claim carved out of the above storage class:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    volume.beta.kubernetes.io/storage-class: linstor-autoplace-1-thin-lvm
  name: linstor-csi-pvc-0
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Put it all together and you’ve got yourself an open source, high performance, block device provisioner for your persistent workloads in Kubernetes!

There are many ways to craft your storage class definitions for node selection, storage tiering, diskless attachments, or even off site replicas. We’ll be working on our documentation surrounding new features, so stay tuned, and don’t hesitate to reach out for the most UpToDate information about LINBIT’s software!

Read more: CSI Plugin for LINSTOR Complete.

Matt Kereczman on Linkedin
Matt Kereczman
Matt is a Linux Cluster Engineer at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT’s support team, and plays an important role in making LINBIT’s support great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt’s hobbies.

LINSTOR OpenStack Banner

How to Setup LINSTOR in OpenStack

This post will walk through the installation and setup procedures for deploying LINSTOR for a persistent, replicated, and high-performance source of block storage within DevStack version of OpenStack running on an Ubuntu host. We will refer to this Ubuntu host as the LINSTOR Controller. This setup also requires at least one additional Ubuntu node handling replicated data, and we will refer to this node as the LINSTOR Satellite. You may have more than one satellite nodes for increased redundancy.

Initial Requirement

The LINSTOR driver is a messenger between the underlying DRBD/LINSTOR and OpenStack. Therefore, both DRBD/LINSTOR as well as OpenStack must be pre-installed and configured. Once LINSTOR is installed, each node must be registered with LINSTOR and have a predefined storage pool on a thin LVM volume.

Install DRBD / LINSTOR on OpenStack Cinder node as a LINSTOR Controller node

# First, download and run a python script to enable LINBIT repo
curl -O 'https://my.linbit.com/linbit-manage-node.py'
chmod u+x linbit-manage-node.py
./linbit-manage-node.py

# Install the DRBD, LINSTOR, and LVM packages
sudo apt install -y drbd-dkms lvm2
sudo apt install -y linstor-controller linstor-satellite linstor-client
sudo apt install -y drbdtop

Configure the LINSTOR Controller

# Start both LINSTOR Controller and Satellite Services
systemctl enable linstor-controller.service
systemctl start linstor-controller.service
systemctl enable linstor-satellite.service
systemctl start linstor-satellite.service

# Create backend storage for DRBD/LINSTOR by creating a Volume Group 'drbdpool'
# Specify appropriate volume location (/dev/vdb)
sudo vgcreate drbdpool /dev/vdb

# Create a Logical Volume 'thinpool' within 'drbdpool'
# Specify appropriate thin volume size (64G)
sudo lvcreate -L 64G -T drbdpool/thinpool

Install DRBD / LINSTOR on all other LINSTOR Satellite node(s)

# First obtain and install DRBD / LINSTOR packages through LINBIT
# by running python script
sudo apt install -y drbd-dkms lvm2
sudo apt install -y linstor-satellite
sudo apt install -y drbdtop

Configure the LINSTOR Satellite node(s)

# Start LINSTOR Satellite Service
systemctl enable linstor-satellite.service
systemctl start linstor-satellite.service

# Create backend storage for DRBD/LINSTOR by creating a Volume Group 'drbdpool'
# Specify appropriate volume location (/dev/vdb)
sudo vgcreate drbdpool /dev/vdb

# Create a Logical Volume 'thinpool' within 'drbdpool'
# Specify appropriate thin volume size (64G)
sudo lvcreate -L 64G -T drbdpool/thinpool

Configure LINSTOR cluster (nodes and storage pool definitions) from the Controller node

# Create the controller node as combined controller and satellite node
linstor node create cinder-node-name 192.168.1.100 --node-type Combined

# Create the satellite node(s)
linstor node create another-linstor-node 192.168.1.101
# repeat to add more satellite nodes in the LINSTOR cluster

# Create LINSTOR Storage Pool on each nodes
# For each node, specify node name, its IP address, 
# storage pool name (DfltStorPool) and volume type (lvmthin)

# On the LINSTOR Controller 
linstor storage-pool create lvmthin cinder-node-name DfltStorPool \
    drbdpool/thinpool
# On the LINSTOR Satellite node(s)
linstor storage-pool create lvmthin another-linstor-node DfltStorPool \
    drbdpool/thinpool
# repeat to add a storage pool to each node in the LINSTOR cluster

 

Cinder Driver Installation & Configuration

Download the latest driver (linstordrv.py)

wget https://github.com/LINBIT/openstack-cinder/blob/stein-linstor/cinder/
volume/drivers/linstordrv.py

Install the driver file in the proper destination

/opt/stack/cinder/cinder/volume/drivers/linstordrv.py

Configure OpenStack Cinder by editing /etc/cinder/cinder.conf
to enable LINSTOR driver by adding ‘linstor’ to enabled_backends

[DEFAULT]
...
enabled_backends=lvm, linstor
…

Then, add a LINSTOR section at the bottom of the cinder.conf

[linstor]
volume_backend_name = linstor
volume_driver = cinder.volume.drivers.linstordrv.LinstorDrbdDriver
linstor_default_volume_group_name=drbdpool
linstor_default_uri=linstor://localhost
linstor_default_storage_pool_name=DfltStorPool
linstor_default_resource_size=1
linstor_volume_downsize_factor=4096
linstor_controller_diskless=False
iscsi_helper=tgtadm

Update Python libraries

sudo pip install protobuf --upgrade
sudo pip install eventlet --upgrade

Register LINSTOR with Cinder

cinder type-key linstor
cinder type-key linstor set volume_backend_name=linstor

Lastly, restart Cinder services

sudo systemctl restart [email protected]
sudo systemctl restart [email protected]
sudo systemctl restart [email protected]

 

Verification of proper installation

Check system journal for any driver errors

# Check if there is a recurring error after restart
sudo systemctl -f -u [email protected]* | grep error

Create a test volume with LINSTOR backend

# Create a 1GiB volume through Cinder and verify LINSTOR backing exists
openstack volume create --type linstor --size 1 --availability-zone nova \
    linstor-test-vol
openstack volume list
linstor resource list

Delete the test volume

# Delete the test volume and verify if LINSTOR removed resources correctly
openstack volume delete linstor-test-vol
linstor resource list

 

Final Comments

By now, the LINSTOR driver should have successfully created a Cinder volume and the matching LINSTOR resources on the backend and then removed them from Cinder. From this point on, managing LINSTOR volumes should be a breeze with OpenStack Horizon’s GUI interface.

Management of LINSTOR snapshots and creation of LINSTOR volumes from those snapshots are also possible. Once a LINSTOR volume becomes available, it can then be made accessible within a Nova instance by creating an attachment. Any LINSTOR-backed volume can then provide replicated and persistent storage.

Please direct any questions regarding the specifics about the driver to Woojay Poynter at [email protected]. For any inquiry regarding DRBD and LINSTOR technology please contact our sales team at [email protected].

Feel free to check out this demonstration of LINSTOR volume management in OpenStack:

 

Woojay Poynter
IO Plumber
Woojay is working on data replication and software-defined-storage with LINSTOR, built on DRBD @LINBIT. He has worked on web development, embedded firmwares, professional culinary education, power carving with ice and wood. He is a proud father and likes to play with legos.