Kubernetes Operator: LINSTOR’s Little Helper

Before we describe what our LINSTOR Operator does, it is a good idea to discuss what a Kubernetes Operator actually is. If you are already familiar with Kubernetes Operators, feel free to skip the introduction.

Introduction

CoreOS describes Operators like this:

An Operator is a method of packaging, deploying and managing a Kubernetes application. A Kubernetes application is an application that is both deployed on Kubernetes and managed using the Kubernetes APIs and  kubectl tooling.

That is quite a lot to grasp if you are new to the concept of Operators. What users want to do is their work, without having to worry about setting up infrastructure. Still, there is more to a software lifecycle than just firing up the application in a cluster once. This is where an Operator comes into play. In my opinion, a good analogy is to think of a Kubernetes Operator as an actual human operator. So what would the responsibilities of such a human operator be?

A human operator would be an expert in the business logic of the software that she runs and the dependencies that need to be fulfilled to run the software. Additionally, the operator would be responsible for configuring the software: to scale it, to upgrade it, to make backups, and so on. This is also the responsibility of a Kubernetes Operator implemented in software. It is a software component that is built by experts for a particular containerized software. This Operator is executed by an administrator who isn’t necessarily an expert for that particular software.

A Kubernetes Operator is executed in the Kubernetes cluster itself. In contrast to shell scripts or Ansbile playbooks, which are pretty generic, the Operator framework has one specific purpose, as well as access to Kubernetes cluster information. Additionally, they are managed by standard Kubernetes tools and not some external tools for configuration management. An Operator has a managed lifecycle on its own and is managed by the lifecycle manager.

LINSTOR Operator

CoreOS FAQ contains the following sentence:

Experience has shown that the creation of an Operator typically starts by automating an application’s installation and self-service provisioning capabilities, and then evolves to take on more complex automation.

This pretty much sums up the current state of the linstor-operator. The project is still very young and the current focus was on automating the setup of the LINSTOR cluster. If you are familiar with LINSTOR, you know that there is a central component named the LINSTOR controller, and workers — named LINSTOR satellites — that actually create LVM volumes and configure DRBD to provide data redundancy.

Currently, the Operator can add new nodes/kubelets to the LINSTOR cluster by registering them with their name and network interface configuration by the LINSTOR controller. A second important task is that a LINSTOR satellite can actually provide storage to the cluster. Therefore, the linstor-operator can also register one or multiple storage pools. Metrics of the LINSTOR satellite set’s storage pools are exposed by the operator. Therefore the system administrator saves a lot of time, because the LINSTOR Kubernetes Operator automates a lot of standard tasks, which were previously performed manually.

Future work

There is a lot of possible future work that can be done. Some tasks are obvious, some will be driven by actual input of our users. For example we think that configuring storage is one of the pain points for our uses. Therefore, having a sidecar container that can discover and prepare storage pools to be consumed by LINSTOR might be a good idea. In a dynamic environment such as Kubernetes it might be worthwhile to handle nodes failures in clever ways. We also have container images that can inject the DRBD kernel module into the running host kernel, which could help users with getting started with DRBD. High Availability is always an important topic and related to that using etcd as database backend. Further, we want to tackle one of the core tasks of Operators, which is managing upgrades from one LINSTOR controller version to the next.

Thanks to Hayley for reviewing this blog post, she is too busy doing the actual work. This is just a subset of the capabilities we plan for the linstor-operator. Stay tuned for more information!

 

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.

NuoDB on LINSTOR-Provisioned Kubernetes Volumes

Introduction:

NuoDB and LINBIT put our technologies together to see just how well they performed, and we are both happy with the results. We decided to compare LINSTOR provisioned volumes against Kubernetes Hostpath (Direct Attached Storage) in a Kubernetes cluster hosted in Google’s cloud platform (GCP) to show that our on-prem testing results can also be proven in a popular cloud-computing environment.

Background:

NuoDB is an ANSI SQL standard and ACID transactional compliant container-native distributed OLTP database that provides responsive scalability and continuous availability. This makes it a great choice for your distributed applications running in cloud provider-managed and open source Kubernetes environments, such as GKE, EKS, AKS and Red Hat OpenShift. As you scale-out NuoDB Transaction Engines in your cluster, you’re scaling out the database’s capacity to process dramatically more SQL transactions per second, and at the same time, building in process redundancy to ensure the database — and applications — are always on.

As you scale your database, you also need to scale the storage that the database is using to persist its data. This is usually where things get sluggish. Highly-scalable storage isn’t always highly-performant, and it seems most of the time the opposite is true. Highly-scalable, highly-performant storage is the niche that LINSTOR aims to fill.

LINSTOR, LINBIT’s SDS software, can be used to deploy DRBD devices in large scale storage clusters. DRBD devices are expected to be about as fast as the backing disk they were carved from, or as fast as the network device DRBD is replicating over (if DRBD’s replication is enabled). At LINBIT we usually aim for a performance impact of less than 5% when using DRBD replication in synchronous mode.

The LINSTOR CSI (container storage interface) driver for Kubernetes allows you to dynamically provision LINSTOR provisioned block devices as persistent volumes for your container workloads… you see where I’m going… 🙂

Testing:

I spun up a 3-node GKE (Google’s Kubernetes Engine) cluster in GCP, and customized the standard node type with 6-vCPU and 22GB of memory for each node:

gke-nodes

When using GKE to spin up a Kubernetes cluster, you’re provided with a “standard” storage class by default. This “standard” storage class dynamically provisions and attaches GCE standard disks to your containers that need persistent volumes. Those GCE standard disks are the pseudo “hostpath” device we wanted to compare against, so we deployed NuoDB into the cluster, and ran a YCSB (Yahoo Cloud Serving Benchmark) SQL workload against it to generate our baseline:

gke-insights-overview

Using the NuoDB Insights visual monitoring tool (comes as standard equipment with NuoDB), we can see in the chart above we had 3 TEs (Transaction Engine) pods feeding into 1 SM (Storage Manager) pod. We can also see that our Aggregate Transaction Rate (TPS) is hovering just over 15K transactions per second. Also, as a side note, this deployment created 5-GCE Standard disks in my Google Cloud Engine account.

LINSTOR provisions its storage from an established LINSTOR cluster, so for our LINSTOR comparison, I had to stand up Kubernetes on GCE nodes “the hard way” so I could also stand up a LINSTOR cluster on the nodes (see LINBIT’s user’s guide or LINSTOR quickstart for more on those steps). I created 4-nodes as VM Instances in Kubernetes. 3-nodes were setup to mimic the GKE cluster, each with 6-vCPU and 22GB of memory, and our 1-master-node – with master node taint in Kubernetes so we will not schedule pods on this node – with 2-vCPU and 16GB of memory. Google recommended I scale these nodes back to save money, so I did that, resulting in the following VM instances:

gce-nodes

After setting up the LINSTOR and Kubernetes cluster in the GCE VM Instances, I attached a single “standard” GCE disk to each node for LINSTOR to provision persistent volumes from, and deployed the same NuoDB distributed database stack and YCSB workload into the cluster:

gce-insights-overview

After letting the benchmarks run for some time, I could see that we were hovering just under 15k, which is within the expected 5% of our ~15k baseline!

Conclusion:

You might be thinking, “That’s good and all, but why not just use GKE with the GCE-backed ‘standard’ storage class?” The answer is features. Using LINSTOR to provide storage to your container platform enables you to:

  • Add replicas of volumes for resiliency at the storage layer – including remote replicas
  • Use replicas of your volumes in DRBD’s read balancing policies which could increase your read speeds beyond what’s possible from a single volume
  • Provide granular control of snapshots at either the Kubernetes or LINSTOR-level
  • Provide the ability to clone volumes from snapshots
  • Enable transparently encrypted volumes
  • Provide data-locality or accessibility policies
  • Lower managerial overhead in terms of the number of physical disks (comparing one GCE disk for each PV with GKE vs. one GCE disk for each storage node with LINSTOR).

Ultimately, the combination of NuoDB and LINSTOR enables clients to run high-performance persistent databases in the cloud or on premise with ease-of scale and “always-on” resiliency. So far, after testing both proprietary and open-source software, NuoDB has found that LINSTOR’s open-source SDS is a production-ready, high-performance, and highly reliable storage solution to provision persistent volumes.

 

Matt Kereczman on Linkedin
Matt Kereczman
Matt is a Linux Cluster Engineer at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT’s support team, and plays an important role in making LINBIT’s support great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt’s hobbies.

DRBD/LINSTOR vs Ceph – a technical comparison

INTRODUCTION

The aim of this article is to give you some insight into CEPH, DRBD and LINSTOR by outlining their basic functions. The following points should help you compare these products and to understand which is the right solution for your system. Before we start, you should be aware of the fact that LINSTOR is made for DRBD and that it is highly recommended for you to use LINSTOR if you are also using DRBD.

DRBD

DRBD works by inserting a thin layer in between the file system, the buffer cache, and the disk driver. The DRBD kernel module captures all requests from the file system and splits them down into two paths. So, how does the actual communication occur? How do two separate servers optimize data protection?

DRBD facilitates communication by mirroring two separate servers. One server, although passive, is usually a direct copy of the other. Any data written to the primary server is simultaneously copied to the secondary server through a real-time communication system. The passive server immediately replicates any changes made in the data.

DRBD 8.x works on two nodes at a time. One is given the role of the primary node while the other is given a secondary role. Reads and writes can only occur on the primary node.

THE BENEFITS OF DRBD 9

The features of DRBD 9.x are a vast improvement over the 8.x version. It is now possible to have up to 32 replicas, including the primary node. This gives you the ability to build your cluster setup with what we call diskless nodes, meaning you don’t have to use storage on your primary node. The primary node in diskless mode still has a DRBD block device, but the data is accessed on the secondary nodes over the network.

The secondary nodes must not mount the file system, not even in read-only mode. While it is true to say that the secondary nodes see all updates on the primary node, they can’t expose these updates to the file system, as DRBD is completely file system agnostic.

One write goes to the actual disk and another to the mirrored disks on a peer node. If the first one fails, the file system can be displayed on one of the opposing nodes and the data will be available for use.

DRBD has no precise knowledge of the file system and, as such, it has no way of communicating the changes upstream to the file system driver. The two-at-a-time rule does not actually limit DRBD from operating on more than two nodes.

Moreover, DRBD-9.x supports multiple peer nodes, meaning one peer might be a synchronous mirror in the local data-center while another secondary might be an asynchronous mirror in a remote site.

Again, the passive server only becomes functional when the primary one fails. When such a failure occurs, Pacemaker immediately recognizes the mishap and shifts to the secondary server. This shifting process, nevertheless, is optional – it can be either manual or automatic. For users who prefer manual, one is required to authorize the system to shift to the passive server when the primary one fails.

LINSTOR

In greater IT infrastructures, cluster managing software is state of the art. This is why LINBIT developed LINSTOR, a software on top of DRBD. DRBD itself is a perfect tool to replicate and access your data, especially when it comes to performance. LINSTOR makes configuring DRBD on a system with more than a few nodes an easy task. LINSTOR manages DRBD and gives you the ability to set it up on a large system.

LINSTOR uses a controller service for managing your cluster and a satellite service which runs on every node for deploying DRBD. The controller can be accessed from every node and enables you to monitor and configure your structure quickly. It can be controlled over REST from the outside and provides a very clear CLI. Furthermore, the LINSTOR REST-API gives you the ability to use LINSTOR volumes in Kubernetes, Proxmox VE, OpenNebula and Openstack.

LINSTOR has a feature to maintain the system at work: There is a separation of control plane vs. data plane. If you wanna upgrade or maintain LINSTOR, there is no downtime of the volumes. In comparison with Ceph, DRBD & LINSTOR are easier to troubleshoot, recover, repair, debug, and easier to intervene manually if required, also mainly due to its simplicity. For sys admin the better maintainability and a less complex environment can be crucial. The higher availability also results in a better reliability. For instance DRBD can be started/stopped manually even if LINSTOR is offline, or, for recovery purposes, even without DRBD installed (simply mount backend storage) – compared to that, trying to find any of your data on disks managed by Ceph can be a quite challenge if your Ceph system is down.

In summary, if you’re looking for increased performance, fast configuration, and filesystem-based storage for your applications, use LINSTOR and DRBD. If you’re looking to run LINSTOR with HA, however, you must use a third-party software such as Pacemaker.

 

CEPH

CEPH is an open source software intended to provide highly scalable object, block, and file-based storage in a unified system.

CEPH consists of a RADOS cluster and its interfaces. The RADOS cluster is a system with services for monitoring and storing data across many nodes. CEPH/RADOS is an object storage cluster with no single-point of failure. This is solved by using an algorithm which cuts the data into blocks and spreads them across the RADOS cluster by using self-managing services. The CRUSH algorithm is used to spread the data on upload and to put the blocks together if an object is requested. CEPH is able to use simple data replication as well as erasure coding for those striped blocks.

On top of the RADOS cluster, LIBRADOS is used to upload or request data from the cluster. CEPH uses LIBRADOS for interfaces CEPHFS, RBD and RADOSGW.

CEPHFS gives you the ability to create a filesystem on a host where the data is stored in the CEPH cluster. Additionally, for using CEPHFS, CEPH needs metadata servers which manage the metadata and balance the load for requests among each other.

RBD or RADOS block device is used for creating virtual block devices on hosts with a CEPH cluster, managing and storing the data in the background. Since RBD is built on LIBRADOS, RBD inherits LIBRADOS’s abilities, including read only snapshots and reverts to snapshot. By striping images across the cluster, CEPH improves read access performance for large block device images. The block device can be virtualized, providing block storage to virtual machines in virtualization platforms such as Apache CloudStack, OpenStack, OpenNebula, Ganeti, and Proxmox Virtual Environment.

RADOSGW is the REST-API for communicating with CEPH/RADOS when uploading and requesting data from the cluster.

In general, CEPH is an object storage cluster with the advantage that you do not have to worry about failing nodes or storage drives, because CEPH recognizes failing devices and replicates the data instantly to another disk where it will be accessed. This also leads to a heavy network load if the devices fail.

Striping data comes with a disadvantage in that it is not possible to access the data on a storage drive by mounting it somewhere else or without a working CEPH cluster.

In conclusion, CEPH is the right solution if you are looking for object storage in your infrastructure. Due to its complexity, you have to expect less performance in comparison to DRBD which is only limited by your network speed. 

Daniel Kaltenböck on Email
Daniel Kaltenböck
Software Engineer at LINBIT HA Solutions GmbH
Daniel Kaltenböck studied technical computer science at the Vienna University of Technology. He is a software engineer by heart with a special focus on software defined storage.

Cheap Votes: DRBD Diskless Quorum

One of the most important considerations when implementing clustered systems is ensuring that a cluster remains cohesive and stable given unexpected conditions. DRBD already has fencing mechanisms and even a system of quorum, which is now capable of using a diskless arbitrator to break ties without requiring additional storage beyond that of two nodes.

Quorum and Fencing With a Healthy Dose of Reality

DRBD’s quorum implementation allows resources to vote on availability, taking into account connection state and disk state. While a DRBD cluster without quorum will allow promotion and writes on any node with “UpToDate” data, DRBD with quorum enabled adds the requirement that this node must also be in contact with either a majority of healthy nodes in the cluster, or a minimum amount of nodes as defined statically. This requires at least three nodes, and works best with odd numbers of nodes. A DRBD cluster with quorum enabled cannot become split-brain.

Fencing on the other hand, employs a mechanism to ensure node state by isolating or powering off a node in some way so that unhealthy nodes can be guaranteed to not provide services (by virtue of being assuredly offline). While the use case for fencing and quorum overlap by a large degree, fencing can automatically eject or recover misbehaving nodes, while quorum simply ensures that they cannot modify data.

It is possible to utilize scripts that are triggered in response to changes in quorum as a simple but effective fencing system via a “suicide” method — configuring a node to automatically reset or power itself off upon loss of quorum (accomplished via the “on-quorum-loss” handler in DRBD’s configuration). However, fully-fledged fencing methods via Pacemaker have much more logic behind them, can work even when the node to be fenced is entirely unresponsive, and make Pacemaker clusters “aware” of fencing actions.

The most important element to consider is that while both methods prevent split-brain conditions, quorum does not wholly and entirely replace out-of-band fencing. However, it comes extremely close, and in fact, close enough to eschew Pacemaker-based fencing in many configurations in favor of only quorum where fencing via privileged APIs (as is common in clouds) or dedicated fencing hardware (such as network PDUs or IPMI cards) is less than possible or desired.

Arbitrators!

Before now, in order for a DRBD resource to have three votes across three nodes for quorum, it needed three replicas of data. This was cost prohibitive in some scenarios, so additional logic was added to allow a diskless “arbitrator” node that does not participate in replication. Thusly, the diskless DRBD arbitrator was born.

The concept is fairly simple; rather than require a minimum of three replicas in a DRBD resource to enable quorum functionality, one can now use two replicas (or “data” nodes) with a third DRBD node in a permanently and intentionally diskless state as an “arbitrator” for breaking ties.

The same concepts of traditional DRBD quorum apply, with one significant exception: In a replica 2+A cluster, one node can be lost or disconnected without losing quorum — just like a replica 3 cluster. However, that arbitrator node cannot (on its own) participate in restoring quorum after it is lost.

The reason for this exception is simple: The arbitrator node has no disk. Without a disk, there is no way to independently determine whether data is valid, inconsistent, or related to the cluster at all because there is no data on that node to compare replicas with. While an arbitrator node cannot restore quorum to a single other inquorate data node, two data nodes may establish or re-establish quorum with each other. This is highly effective, and conquers the vast majority of quorum decisions at roughly 66% the cost of a replica 3 cluster.

Arbitrator Nodes in Action

I will not abide this level of grandstanding without a demonstration of this ability (and hopefully some revealing use cases), so below are some brief test results from a replica 2+A geo cluster. Behold:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b role:Secondary

peer-disk:UpToDate

geo-nfs-c role:Secondary

peer-disk:Diskless

As you can see, everything is happy. All of these nodes are connected and up to date. Nodes “geo-nfs-a” and “geo-nfs-b” are data nodes with disks. The node “geo-nfs-c” is a diskless DRBD arbitrator as well as a Booth arbitrator, and quorum has been enabled in this geo cluster (though that’s not reflected in this output). Geo clusters can be tricky to manage the datapath of, since they often operate outside of the scope of rapid decision-making mechanisms and even more often don’t have a method of fencing “sites” adequately. Using DRBD quorum in this case allows split-brains to be entirely prevented globally, rather than depending on several disconnected cluster controllers to manage things. This is much more stable, but requiring three sites with at least one full data replica each is very bandwidth-intensive as well as expensive. This is a perfect fit for an arbitrator node.

If we take one of the two data nodes offline, the cluster will still run. We’re still in contact with the arbitrator, and as long as we don’t lose that contact, quorum will be held:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b connection:Connecting

geo-nfs-c role:Secondary

peer-disk:Diskless

So let’s make it unhappy. If we take the majority of nodes offline this cluster will freeze, suspending I/O and protecting data from split-brain:

[email protected]:~# drbdadm status export-able

export-able role:Primary suspended:quorum

disk:UpToDate quorum:no blocked:upper

geo-nfs-b connection:Connecting

geo-nfs-c connection:Connecting

Reconnecting only the arbitrator node will not result in a quorate cluster, as that arbitrator has no way of knowing whether that data node is actually valid:

[email protected]:~# drbdadm status export-able

export-able role:Primary suspended:quorum

disk:UpToDate quorum:no blocked:upper

geo-nfs-b connection:Connecting

geo-nfs-c role:Secondary

peer-disk:Diskless

Connecting the peer data node will result in I/O resuming even if the arbitrator is still not functioning:

[email protected]:~# drbdadm status export-able

export-able role:Primary

disk:UpToDate

geo-nfs-b-0 role:Secondary

peer-disk:UpToDate

geo-nfs-c connection:Connecting

Conclusion

I was able to use a Booth arbitrator node as a DRBD arbitrator node as well, both managing the cluster application state as well as securing the datapath against corruption with almost zero bandwidth usage beyond that of a 2N system. This is clearly a potent use-case and could not be more simple.

This new quorum mechanism could be applied identically to local high availability clusters, allowing reliable quorate systems to be established using a very low power third node. This can help to cheaply circumvent environmental problems that prevent adequate fencing, such as generic platform-agnostic deployment models, security-restricted environments, and even total lack of out-of-band fencing mechanisms (such as some public clouds or specialized hardware).

For posterity, the following DRBD configuration was used to accomplish this. Keep in mind, this was a geo cluster, so it’s using asynchronous replication (protocol A). Protocol C would be used for synchronous local replication:

# /etc/drbd.conf

global {

    usage-count yes;

}

common {

    options {

           auto-promote     yes;

           quorum           majority;

    }

}

resource export-able {

    volume 0 {

           device           minor 0;

           disk             /dev/drbdpool/export-able;

           meta-disk        internal;

    }

    on geo-nfs-a {

           node-id 0;

           address          ipv4 10.1.0.100:7000;

    }

    on geo-nfs-b {

           node-id 1;

           address          ipv4 10.2.0.100:7000;

    }

    on geo-nfs-c {

           node-id 2;

           volume 0 {

                   device       minor 0;

                   disk         none;

           }

           address          ipv4 10.3.0.100:7000;

    }

    connection-mesh {

           hosts geo-nfs-a geo-nfs-b geo-nfs-c;

           net {

                   protocol A;

           }

    }

}

David Hay on Linkedin
David Hay
Cluster Daemon at LINBIT
A long-time Linux system engineer, David Hay finds FOSS solutions to global problems as a Cluster Engineer at LINBIT. David started out with open source software back in the Linux 2.4 days, since then having planned and implemented countless clustered systems, leveraging HA and cloud technologies to great effect. When not liberating the enterprise world with free and open software, he spends his time tinkering with electronics and metalworking.
linstor_value_store

Key/Value Store in LINSTOR

Recently we introduced a Key/Value store in LINSTOR and exposed it in a developer-friendly way in the Python API (python-linstor). The first question is why would one want such a Key/Value store in LINSTOR when there are many high performance implementations such as etcd. The request for a K/V store was mainly driven by LINSTOR plugin developers. For example many plugins need to store some kind of meta data like a description for a resource. Existing, non-LINSTOR plugins sometimes store such information in a local json file or in a file per resource. This, on one hand, is clumsy and on the other hand in a distributed system like DRBD/LINSTOR, the data needs be available on all nodes.

In LINSTOR a K/V store has a unique name (e.g., one per plugin) and it can store up to 510 bytes for a key, and 4096 bytes for the value. The implementation in python-linstor provides an interface that mimics a Python3 dictionary. In addition to the discussed unique name, the K/V store as implemented in the Python library also provides so called namespaces. One can think of a namespace as a UNIX directory structure where components of a path (i.e., the namespace) are separated by a /. In the following we show an example using the Python library:

import linstor
kv = linstor.KV('myKV', namespace='/foo/bar/')
kv['key'] = 'val'
list(kv.items()) -> [('key', 'val')]
kv.namespace = '/'
list(kv.items()) -> [('/foo/bar/key', 'val')]
kv['foo/baz/key'] = 'valbaz'
kv.namespace = '/foo/bar'
list(kv.items()) -> [('key', 'val')] # keys in /foo/baz not visible

Key/Value Store makes life easier

Developers already familiar with LINSTOR details might know there is a concept that sounds similar to what the K/V store can do, the so called “AUX props”. One can attach meta data to basically every LINSTOR object. While they sound similar, there are noteworthy differences:

• An AUX prop is tied to the according object. When the object is gone, the meta data is gone. This might be desired and can be an advantage.

• The K/V store exists as long as the LINSTOR cluster exists. Data is not attached to another LINSTOR object. Depending on the situation this might be an advantage compared to a plain AUX property.

• The K/V store has a much nicer interface. It just behaves like a Python dictionary.

• The K/V store and its namespace implementation make it a lot easier to store hierarchical data.

• Searching AUX props can be difficult: For example to find a specific AUX prop set on volume definition, one would have to iterate over the AUX props of.

All in all the K/V store makes the life of a plugin developer much easier. BTW: The text of this blog post easily fits into a single K/V pair 😀

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.
Container-kubernetes-linsto

Containerize LINSTOR

LINBIT and its Software-Defined Storage (SDS) solution LINSTOR has provided integration with Linux containers for quite some time. These range from a Docker volume plugin, to a Flexvolume plugin, and recently, a CSI plugin for Kubernetes. While we always provided excellent integration to the container world, most of our software itself was not available as a container/base image. Containerizing our services is a non-trivial task. As you probably know, the core of the DRBD software consists of a Linux kernel module and user space utilities that interact via netlink with this kernel module. Additionally, our software needs to create LVM devices and DRBD block devices within a container. These tasks are interesting and challenging to put into containers. For this article, we assume 3 nodes, one node that acts as a LINSTOR controller, and two that act as satellites. We tested this with recent Centos7 machines and with a current version of Docker.

Prerequisites

In this article, we assume access to our Docker registry hosted on drbd.io. On all hosts you should run the following commands:

docker login drbd.io
Username: YourUserName
Password: YourPassword
Login Succeeded

Installing the DRBD kernel modules

We need the DRBD kernel module and its dependencies on the LINSTOR satellites (the controller does not need access to DRBD). For that we provide a solution for the most common platforms, namely Centos7/RHEL7 and Ubuntu Bionic.

docker run --privileged -it --rm \
  -v /lib/modules:/lib/modules drbd.io/drbd9:rhel7
DRBD modul sucessfully loaded 

What this does is check which kernel is actually executed on the host, then found it the most appropriate package in the container and installed it. We ship the same, unmodified rpm/deb packages in the container as we provide in our customer repositories. If you are using Ubuntu Bionic, you should use the drbd.io/drbd9:bionic container.

Running a LINSTOR controller

docker run -d --name=linstor-controller \
  -p 3376:3376 -p 3377:3377 drbd.io/linstor-controller

The controller does not have any special requirements, it just needs to be accessible to the client via TCP/IP. Please note that in this configuration the controller’s database is not persisted. One possibility is to bind-mount the directory used for the controller’s database by adding
-v /some/dir:/var/lib/linstor .

Running a LINSTOR satellite

docker run -d --name=linstor-satellite --net=host \
 --privileged drbd.io/linstor-satellite 

The satellite is the component that creates actual block devices. On one hand the backing devices (usually LVM) and the actual DRBD block devices. Therefore this container needs access to/dev, and it needs to share the host networking. Host networking is required for the communication between drbdsetup and the actual kernel module.

Configuring the Cluster

We have to set up LINSTOR as usual, which fortunately, is an easy task and has to be done only once. In the spirit of this blog post, let’s use a containerized LINSTOR client as well. As the client obviously has to talk to the controller, we need to tell the client in the container where to find the controller. This is done by setting the environment variable LS_CONTROLLERS.

docker run -it --rm -e LS_CONTROLLERS=Controller \ 
  drbd.io/linstor-client interactive
  ...
- volume-definition (vd)
LINSTOR ==> node create Satellite1 172.42.42.10
LINSTOR ==> node create Satellite2 172.42.42.20
LINSTOR ==> storage-pool-definition create drbdpool
LINSTOR ==> storage-pool create lvm Satellite1 drbdpool drbdpool
LINSTOR ==> storage-pool create lvm Satellite2 drbdpool drbdpool 

Creating a replicated DRBD resource

So far we loaded the kernel module on the satellites, started the controller and satellite containers and configured the LINSTOR cluster. Now it is time to actually create resources.

docker run -it --rm -e LS_CONTROLLERS=Controller \ 
  drbd.io/linstor-client interactive
  ... 
- volume-definition (vd)
 LINSTOR ==> resource-definition create demo
 LINSTOR ==> volume-definition create demo 1G
 LINSTOR ==> resource create demo --storage-pool drbdpool --auto-place 2 

If you have drbd-utils installed on the host, you can now see the DRBD resource as usual viadrbdsetup status. But we can also use a container to do that. On one of the satellites you can run a throw-away linstor-satellite container which contains drbd-utils:

docker run -it --rm --net=host --privileged \
 --entrypoint=/bin/bash drbd.io/linstor-satellite
$ drbdsetup status
$ lvs

Note that by default you will not see the symbolic links for the backing devices created by LVM/udev in the LINSTOR satellite container. That is expected. In the container you will see something like /dev/drbdpool/demo_00000, while on the host you will only see/dev/dm-X, and  lvs will not show the LVs. If you really want to see the LVs on the host, you could execute  lvscan -a --cache, but there is no actual reason for that. One might also map the lvmetad socket to the container.

Summary

As you can see, LINBIT’s container story is now complete. It is now possible to deploy the whole stack via containers. This ranges from the lowest level of providing the kernel modules to the highest level of LINSTOR SDS including the client, the controller, and satellites.

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.
Linbeat - band, linux

A new star is born: LIN:BEAT – The Band

LINBIT is opening up a new division to specifically address our community’s desire to turn the music up to 11! As you all know, LINBIT is famous for its DRBD, Software-Defined Storage (SDS) and Disaster Recovery (DR) solutions. In a paradigm-shifting turn of events by management, LINBIT has decided to expand into the music industry. Since there is so much business potential in playing live concerts, LINBIT has transformed five of their employees into LIN:BEAT – The Band. These concerts are of course made highly-available by utilizing LINBIT’s own DRBD software.

Linbeat - band, linux

The Band will be touring all Cloud and Linux events around the globe in 2019. Band members use self-written code to produce their unique sound design, reminiscent of drum and bass and heavily influenced by folk punk. The urge to portray all the advantages of DRBD and LINSTOR is so strong they had to send a message to the world: Their songs tell the world about the ups and downs of administrators who handle big storage clusters. LIN:BEAT offers a variety of styles in their musical oeuvre: While “LINSTOR” is a heavy-driven rock song and “Snapshots” speaks to funk-loving people, even “Disaster Recovery,” a love ballad, made it into their repertoire. Lead singer, Phil Reisner, sings sotto voce about his lost love — a RAID rack called “SDS”. Reisner told the reporters, “Administrators are such underrated people. This is sadly unfortunate. We strive to give all administrators a voice. Even if it’s a musical one!”

Crowds will be jumping up and down in excitement when LIN:BEAT comes to town! Be there and code fair!

An Excerpt of the song “My first love is DRBD” written by Phil Reisner:

My first love is DRBD,

there has never been a fee

instead it serves proudly as open source

your replication has changed the course

It’s the crucial key for Linux’s destiny

man-person-jumping-desert

LINSTOR grows beyond DRBD

For quite some time, LINSTOR has been able to use NVMe-oF storage targets via the Swordfish API. This was expressed in LINSTOR as a resource definition that contains a single resource with one backing disk (that is the NVMe-oF target) and one diskless resource (that is the NVMe-oF initiator).

Layers in the storage stack

In the last few months the team has been busy making LINSTOR more generic, adding support for resource templates. A resource template describes a storage stack in terms of layers for specific resources/volumes. Here are some examples of such storage stacks:

    • DRBD on top of logic volumes (LVM)
    • DRBD on top of zVols (ZFS)
    • Swordfish initiator & target on top of logic volumes (LVM)
    • DRBD on top of LUKS on top of logic volumes (LVM)
  • LVM only

The team came up with an elegant approach that introduces these additional resource templates in ways that allow existing LINSTOR configurations to keep their semantics as the default resource templates.

With this decoupling, we no longer need to have DRBD installed on LINSTOR clusters that do not require the replication functions of DRBD.

What does that mean for DRBD?

The interests of LINBIT’s customers vary widely. Some want to use LINSTOR without DRBD – which is now supported. A very prominent example of this is Intel, who uses LINSTOR in its Rack Scale Design effort to connect storage nodes and compute nodes with NVMe-oF. In this example, the storage is disaggregated from the other nodes.

Other customers see converged architectures as a better fit. For converged scenarios, DRBD has many advantages over a pure data access protocol such as NVMe-oF. LINSTOR is built from the ground up to manage DRBD, therefore, the need for DRBD support will remain.

Linux-native NVMe-oF and NVMe/TCP

SNIA’s Swordfish has clear benefits with creating a standard for managing storage targets such as allowing optimized storage target implementations, as well as a hardware-accelerated data-path, non-Linux control path.

Due to the fact that Swordfish is an extension of Redfish, which needs to be implemented in the Baseboard Management Controller (BMC), we have decided to extend LINSTOR’s driver set to configure NVMe-oF target and initiator software. We do this by utilizing existing tools found within the Linux operating system, eliminating the need for a Swordfish software stack.

Summary

LINSTOR now supports configurations without DRBD. It is now a unified storage orchestrator for replicated and non-replicated storage.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

Nvme-oF-Linstor-speed

Speed Up! NVMe-oF for LINSTOR

What is NVMe?

The storage world has gained a number of new terms in the last few years. Let’s start with NVMe. The abbreviation stands for Non-Volatile Memory express, which isn’t very self-explanatory. It all began a few years back when NAND Flash started to make major inroads into the storage industry, and the new storage medium needed to be accessed through existing interfaces like SATA and Serial attached SCSI (SAS).

Back at that time, FusionIO created a NAND flash-based SSD that was directly plugged into the PCIe slot of a server. This eliminated the bottleneck of the ATA or SCSI command sets and the interfaces coming from a time of rotating storage media.

The FusionIO products shipped with proprietary drivers, and the industry set forth in creating an open standard that suits the performance of NAND flash. One of the organizations where the players of the industry can meet, align and create standards is the Storage Networking Industry Association ( SNIA).

The first NVMe standard was published in 2013, and it describes a PCIe-based interface and command set to access fast storage. This can be thought of as a cleaned up version of the ATA or SCSI commands plus a PCIe interface.

What is NVMe-oF and NVMe/TCP?

Similar to what iSCSI is to SCSI, NVMe-oF or NVMe/TCP are standards that describe how to send the NVMe commands over networks. NVMe-oF requires a RDMA-capable network (like InfiniBand or RoCE), and NVMe/TCP works on every network that can carry IP traffic.

There are two terms of which to be aware: 1) the initiator is where the applications run that want to access the dataset. Linux comes with a built-in initiator, likewise other OSes already have initiators or will have them soon.

And, 2) the target is where the data is stored. Linux comes with a software target built into the kernel. It might not be obvious that any Linux block device can be made available as a NVMe-oF target using the Linux target software. It is not limited to NVMe devices.

What does this have to do with Swordfish?

While the iSCSI or NVMe-oF standards describe how the READ, WRITE and other operations on block data are shipped from the initiator to the target, they do not describe how a target (volume) gets created or configured. For too many years, this was the realm of vendor specific APIs and GUIs.

SNIA’s Swordfish standard describes how to manage storage targets and make it accessible as NVMe-oF targets. It is a REST API with JSON data. As such, it is easy to understand and embrace.

The major drawback of Swordfish is mainly that it is defined as an extension of Redfish. Redfish is a standard to manage servers over the network. It can be thought of as a modernized IPMI. As such, Redfish will usually be implemented on a Baseboard Management Controller (BMC). While Redfish has its advantages over IPMI, it does not provide something completely new.

On the other hand, Swordfish is something that was not there before, but as it is an extension to Redfish, an implementation of it usually means that the BMC of the machine needs to have a Redfish-enabled BMC, which may hinder or slow down the adoption of Swordfish.

LINSTOR

Since version 0.7, LINSTOR is capable of working with storage provided by Swordfish-compliant storage targets, as well as their initiator counterparts.

Summary

LINSTOR has gained the capability of managing storage on Swordfish/NVMe-oF targets besides working with DRBD and direct attached storage on Linux servers.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

LINSTOR High Level Resource API

High Level Resource API – The simplicity of creating replicated volumes

In this blog post, we present one of our recent extensions to the LINSTOR ecosystem: A high-level, user-friendly Python API that allows simple DRBD resource management via LINSTOR.

Background: So far LINSTOR components communicated by the following means: Via Protocol Buffers, or via the Python API that is used in the linstor command line client. Protocol Buffers are a great way to transport serialized structured data between LINSTOR components, but by themselves they don’t provide the necessary abstraction for developers.

That is not the job of Protocol Buffers. Since the early days we split the command line client into the client logic (parsing configuration files, parsing command line arguments…), and a Python library (python-linstor). This Python library provides all the bits and pieces to interact with LINSTOR. For example it provides a MultiLinstor class that handles TCP/IP communication to the LINSTOR controller. Additionally, it allows all the operations that are possible with LINSTOR (e.g. creating nodes, creating storage pools…). For perfectly valid reasons this API is very low level and pretty close to the actual Protocol Buffer messages sent to the LINSTOR controller.

By developing more and more plugins to integrate LINSTOR into other projects like OpenStack, OpenNebula, Docker Volumes, and many more, we saw that there is need for a higher level abstraction.

Finding the Right Abstraction

The first dimension of abstraction is to abstract from LINSTOR internals. For example it perfectly makes sense that recreating an existing resource is an error on a low level (think of it as EEXIST). On a higher level, depending on the actual object, trying to recreate an object might be perfectly fine and one wants to get the existing object (i.e. idem-potency).

The second dimension of abstraction is from DRBD and LINSTOR as a whole. Developers dealing with storage already have a good knowledge about concepts like nodes, storage pools, resource, volumes, placement policies… This is the part where we can make LINSTOR and DRBD accessible for new developers.

The third goal was to only provide a set of objects that are important in the context of the user/developer. This, for example, means that we can assume that the LINSTOR cluster is already set up, so we do not need to provide a high-level API to add nodes. For the higher-level API we can focus on [LINSTOR] resources. This allows us to satisfy the KISS (keep-it-simple-stupid) principle. A forth goal was to introduce new, higher-level concepts like placement policies. Placement policies/templates are concepts currently developed in core LINSTOR, but we can already provide basics on a higher level.

Demo Time

We start by creating a 10 GB big replicated LINSTOR/DRBD volume in a 3 node cluster. We want the volume to be 2 times redundant. Then we increase the size of the volume to 20 GB.

$ python
>> import linstor

>> foo = linstor.Resource('foo')

>> foo.volumes[0] = linstor.Volume("10 GB")

There are multiple ways to specify the size.

>> foo.placement.redundancy = 2

>> foo.autoplace()

>> foo.volumes[0].size += 10 * (2 ** 30)

This line is enough to resize a replicated volume cluster wide.

We needed 5 lines of code to create a replicated DRBD volume in a cluster! Let that sink in for a moment and compare it to the steps that were necessary without LINSTOR: Creating backing devices on all nodes, writing and synchronizing DRBD res(ource) files, creating meta-data on all nodes, drbdadm up the resource and force one to the Primary role to start the initial sync.

For the next step we assume that the volume is replicated and that we are a storage plugin developer. Our goal is to make sure the volume is accessible on every node because the block device should be used in a VM. So, A) make sure we can access the block device, and B) find out what the name of the block device of the first volume actually is:

>>> foo.activate(socket.gethostname())

>>> print(foo.volumes[0].device_path)

The method activate is one of these methods that shows how we intended abstraction. Note that we autoplaced the resource 2 times in a 3-node cluster. So LINSTOR chose the nodes that fit best. But now we want the resource to be accessible on every node without increasing the redundancy to 3 (because that would need additional storage and 2 times replicated data is good enough).

Diskless clients

Fortunately DRBD has us covered as it has the concept of diskless clients. These nodes provide a local block device as usual, but they read and write data from/to their peers only over the network (i.e. no local storage). Creating this diskless assignment is not necessary if the node was already part of the replication in the first place (then it already has access to the data locally).

This is exactly what activate does: If the node can already access the data – fine, if not, create a diskless assignment. Now assume we are done and we do not need access to the device anymore. We want to do some cleanup because we do not need a diskless assignment:

>>> foo.deactivate(socket.gethostname()) 

The semantic of this method is to remove the assignment if it is diskless (as it does not contribute to actual redundancy), but if it is a node that stores actual data, deactivate does nothing and keeps the data as redundant as it was. This is only a very small subset of the functionality the high-level API provides, there is a lot more to know like creating snapshots, converting diskless assignments to diskful ones and vice versa, or managing DRBD Proxy. For more information check the online documentation.

If you want to go deeper into the LINSTOR universe, please visit our youtube channel.

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.