sync rate controller

DRBD and the sync-rate controller, part 3

This is an update to our previous two blog posts here and here. The goal with this post is to even further simplify the steps needed to tune the sync-rate controller. If you compare this post against the previous two post, you’ll see that I omit a few options, and even just simply pick some arbitrary starting values that work best in most deployments we’ve encountered.

I would also like to point out again that this is all about initial device synchronization and recovery resynchronization. This has no effect on the replication speeds which occur under normal replication when everything is in a healthy state.


Purpose of the sync rate controller

The dynamic sync-rate controller for DRBD was introduced way back in version 8.3.9. It was introduced as a way to slow down DRBD resynchronization speeds. The idea here is that if you have a write intensive application running atop the DRBD device, it may already be close to filling up your I/O bandwidth. We introduced the dynamic rate limiter to then make sure that recovery resync does not compete for bandwidth with the ongoing write replication. To ensure that the resync does not compete with application IO, the defaults lean towards the conservative side.

If the defaults seem slow to you or your use case, you can speed things up with a little bit of tuning in the DRBD configuration.

Tuning the sync rate controller

It is nearly impossible for DRBD to know just how much activity your storage and network backend can handle. It is fairly easy for DRBD to know how much activity it generates itself, which is why we tune how much network activity we allow DRBD to generate.

  • Set c-max-rate to 100% (or slightly more) than what your hardware can handle.
    • For example: if you know your network is capable of 10Gb/s, but your disk throughput is only 800MiB/s, then set this value to 800M.
  • Increase max-buffers to 40k.
    • 40k is usually a good starting point, but we’ve seen good results with anywhere between 20k to 80k.
  • Set c-fill-target to 1M.
    • Just trust us on this, and simply set it to ‘1M’.

This should be enough to get the resync rate going well beyond the defaults. Many people often tune the “c-*” sync rate controller setting, but never increase the max-buffers value. This may be partly our fault as we never mentioned it in the previous blog post, which is one reason I am revisiting this topic today.

Tuning the sync rate controller even further

Obviously, there is even further tuning we can do. Some of these, if tuned improperly, may have negative impacts on the application performance of programs writing to the DRBD device, so use caution. I would suggest starting with smaller values and working your way up if performing this tuning on production systems.

  • Set the resync-rate to ⅓ of the c-max-rate.
    • With the dynamic resync-rate controller, this value is only used as a starting point. Changing this will only have a slight effect, but will help things speed up faster.
  • Increase the c-min-rate to ⅓ of the c-max-rate.
    • It is usually advised to leave this value alone as the idea behind the dynamic sync rate controller is to “step aside” and allow application IO to take priority. If you really want to ensure things always move along at a minimum speed, then feel free to tune this a bit. As I mentioned earlier, you may want to start with a lower value and work up if doing this on a production system.
  • Set sndbuf-size and rcvbuf-size to 10M.
    • This is generally auto-tuned by the kernel, but cranking this up may help to move along the recovery resync speeds. There is also a possibility that this will lead to buffer-bloat, so tune these with caution. Again, on a production system, start with a value just a little over 4M and increase it slowly while observing the systems.

It is our hope that the information above will prove useful to some of our users and help possibly clear up some confusion regarding the resync tunables we have discussed in the past. As always, please feel free to drop us a comment below if you have any questions or anything you’d like to share.

Devin Vance on Linkedin
Devin Vance
First introduced to Linux back in 1996, and using Linux almost exclusively by 2005, Devin has years of Linux administration and systems engineering under his belt. He has been deploying and improving clusters with LINBIT since 2011. When not at the keyboard, you can usually find Devin wrenching on an American motorcycle or down at one of the local bowling alleys.

What’s the Difference Between Off-Site Data Backup and Disaster Recovery (DR)?

Many businesses make the mistake of confusing “Disaster Recovery” with “off-site data backup.” Confusing the two may lead to paying a high price in the form of data loss and downtime when a disaster hits. Off-site backups are a necessary component of business continuity in the face of disaster for nearly any organization, but DR goes a step beyond – addressing service availability, not only data durability.

Off-Site Data Backup

Data backup describes the capturing of a point-in-time set of data to a local device, or to a device at another location. There are some online services that provide general data backups. Tape drives, USB drives, and network attached storage are able to save your data as well. One can typically view these backups whenever they wish, optionally restoring that data to wherever they want. Usually, the backups are taken repeatedly at regular intervals to keep up with data changes.

However, protecting your business from downtime is also very important. For example, if a critical server fails and you are running only a backup, services cannot be restored quickly. To get back up and running, the server would need replacement, data and software re-installed, and finally additional configuration applied. This could take anywhere from a few hours to a few days.

It takes a long time to restore data from off-site backups — even if employing high-throughput cloud services like AWS S3, it can still take an inordinate amount of time to copy the entire dataset over most WAN connections. Can you afford to lose days’ worth of business presence?

Disaster Recovery

Backing up your data is always imperative. Therefore, the question is whether or not you need a higher form of business continuity via Disaster Recovery. Knowing if you should also utilize Disaster Recovery in addition to backups depends on the availability requirements of your business. If you need your services and business functionality up and running as quickly as possible after a significant technical failure, or need to tolerate the complete failure of a location, then you need Disaster Recovery. Disaster Recovery prevents your company from significant losses of not only data, but also downtime.

It’s difficult to describe Disaster Recovery without first defining business continuity; a collection of policies, technologies, and practices which can be applied to survive significant destruction or unavailability of resources — often addressing a geographically significant scale. Disaster Recovery focuses on the technical aspects of business continuity.

Disaster Recovery mitigates the effects of significant loss from events such as natural disaster, a power grid failure, or human error.

In 2003, a high-voltage power line in Ohio brushed against some overgrown trees and shut down. This sort of problem usually triggers an alarm, but the alarm system had failed. As system operators were trying to diagnose the problem, three other lines also brushed into trees and switched off. The nearby lines shouldering the extra burden were overtaxed and caused a cascade of failures throughout southeast Canada and eight northeastern states.

This event – concluding to be the result of human error and equipment failures – caused 50 million people to lose power for up to two days and is recorded as the largest blackout in North American history. It cost roughly $6 billion dollars in economic losses.

Although this was a rare event, and by no means the only event that could cause a loss in data, current statistics indicate a blackout of this level will occur every 25 years. Companies using DR were able to cut over services to an already-established system and use the up-to-date data that was being actively replicated at the failed site.

You can learn more about Disaster Recovery (DR) in this ten minute whiteboard explanation here:

In short, Disaster Recovery enables your team of employees to get right back to work managing your customers’ data with very little downtime. DR keeps redundant sites as up-to-date as possible with each other, yet also available in a worst-case scenario.  

Overview

At some level, you need both data backup and Disaster Recovery. It’s important to understand what amount of downtime your business can tolerate. A few minutes? A few days? Are you prepared to suffer from data losses? Significant data losses happen under many different circumstances. It’s up to you to define the amount of loss you can accept in a worse-case scenario. For the applications you find most critical, it might make the most sense to hold live replicas off-site, and even have a mechanism to automatically fail-over your primary site to a secondary site to ensure continuity in the face of disaster.  

Learn More

Would you like to hear a more technical explanation of Disaster Recovery and High Availability? Check out LINBIT’s Disaster Recovery page or our videos and other detailed pages. We’re always happy to provide a sanity check on your DR strategy. Contact us to learn how we can help.

Disaster Recovery (DR) Explained Video

High Availability (HA) Explained Video

LINBIT’s Disaster Recovery

LINBIT HA

Brian Hellman
Since 2008, Brian has been the Chief Operating Officer of LINBIT and the head of the LINBIT USA team, where he and his team have lead continual double digit growth over his tenure. Brian is committed to bringing High Availability, Disaster Recovery and Software-Defined Storage technologies to the Open Source community. Outside of LINBIT, Brian is a dedicated philanthropist through Oregon Freemasonry and The Shriners Hospital for Children.
Container-kubernetes-linsto

Containerize LINSTOR

LINBIT and its Software-Defined Storage (SDS) solution LINSTOR has provided integration with Linux containers for quite some time. These range from a Docker volume plugin, to a Flexvolume plugin, and recently, a CSI plugin for Kubernetes. While we always provided excellent integration to the container world, most of our software itself was not available as a container/base image. Containerizing our services is a non-trivial task. As you probably know, the core of the DRBD software consists of a Linux kernel module and user space utilities that interact via netlink with this kernel module. Additionally, our software needs to create LVM devices and DRBD block devices within a container. These tasks are interesting and challenging to put into containers. For this article, we assume 3 nodes, one node that acts as a LINSTOR controller, and two that act as satellites. We tested this with recent Centos7 machines and with a current version of Docker.

Prerequisites

In this article, we assume access to our Docker registry hosted on drbd.io. On all hosts you should run the following commands:

docker login drbd.io
Username: YourUserName
Password: YourPassword
Login Succeeded

Installing the DRBD kernel modules

We need the DRBD kernel module and its dependencies on the LINSTOR satellites (the controller does not need access to DRBD). For that we provide a solution for the most common platforms, namely Centos7/RHEL7 and Ubuntu Bionic.

docker run --privileged -it --rm \
  -v /lib/modules:/lib/modules drbd.io/drbd9:rhel7
DRBD modul sucessfully loaded 

What this does is check which kernel is actually executed on the host, then found it the most appropriate package in the container and installed it. We ship the same, unmodified rpm/deb packages in the container as we provide in our customer repositories. If you are using Ubuntu Bionic, you should use the drbd.io/drbd9:bionic container.

Running a LINSTOR controller

docker run -d --name=linstor-controller \
  -p 3376:3376 -p 3377:3377 drbd.io/linstor-controller

The controller does not have any special requirements, it just needs to be accessible to the client via TCP/IP. Please note that in this configuration the controller’s database is not persisted. One possibility is to bind-mount the directory used for the controller’s database by adding
-v /some/dir:/var/lib/linstor .

Running a LINSTOR satellite

docker run -d --name=linstor-satellite --net=host \
 --privileged drbd.io/linstor-satellite 

The satellite is the component that creates actual block devices. On one hand the backing devices (usually LVM) and the actual DRBD block devices. Therefore this container needs access to/dev, and it needs to share the host networking. Host networking is required for the communication between drbdsetup and the actual kernel module.

Configuring the Cluster

We have to set up LINSTOR as usual, which fortunately, is an easy task and has to be done only once. In the spirit of this blog post, let’s use a containerized LINSTOR client as well. As the client obviously has to talk to the controller, we need to tell the client in the container where to find the controller. This is done by setting the environment variable LS_CONTROLLERS.

docker run -it --rm -e LS_CONTROLLERS=Controller \ 
  drbd.io/linstor-client interactive
  ...
- volume-definition (vd)
LINSTOR ==> node create Satellite1 172.42.42.10
LINSTOR ==> node create Satellite2 172.42.42.20
LINSTOR ==> storage-pool-definition create drbdpool
LINSTOR ==> storage-pool create lvm Satellite1 drbdpool drbdpool
LINSTOR ==> storage-pool create lvm Satellite2 drbdpool drbdpool 

Creating a replicated DRBD resource

So far we loaded the kernel module on the satellites, started the controller and satellite containers and configured the LINSTOR cluster. Now it is time to actually create resources.

docker run -it --rm -e LS_CONTROLLERS=Controller \ 
  drbd.io/linstor-client interactive
  ... 
- volume-definition (vd)
 LINSTOR ==> resource-definition create demo
 LINSTOR ==> volume-definition create demo 1G
 LINSTOR ==> resource create demo --storage-pool drbdpool --auto-place 2 

If you have drbd-utils installed on the host, you can now see the DRBD resource as usual viadrbdsetup status. But we can also use a container to do that. On one of the satellites you can run a throw-away linstor-satellite container which contains drbd-utils:

docker run -it --rm --net=host --privileged \
 --entrypoint=/bin/bash drbd.io/linstor-satellite
$ drbdsetup status
$ lvs

Note that by default you will not see the symbolic links for the backing devices created by LVM/udev in the LINSTOR satellite container. That is expected. In the container you will see something like /dev/drbdpool/demo_00000, while on the host you will only see/dev/dm-X, and  lvs will not show the LVs. If you really want to see the LVs on the host, you could execute  lvscan -a --cache, but there is no actual reason for that. One might also map the lvmetad socket to the container.

Summary

As you can see, LINBIT’s container story is now complete. It is now possible to deploy the whole stack via containers. This ranges from the lowest level of providing the kernel modules to the highest level of LINSTOR SDS including the client, the controller, and satellites.

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.
resync-chart

How LINBIT Improved the DRBD Sync-Rate Logic

About nine years ago, LINBIT introduced the dynamic sync-rate controller in DRBD 8.3.9. The goal behind this was to tune the amount of resync traffic so as to not interfere or compete with application IO to the DRBD device. We did this by examining the current state of the resync, application IO, and network, ten times a second and then deciding on how many resync requests to generate.

We knew from experimentation that we could achieve higher resync throughput if we were to poll the situation during a resync more than ten times a second. However, we didn’t want to shorten this interval by default as this would uselessly consume CPU cycles for the DRBD installations on slower hardware. With DRBD 9.0.17, we now have some additional logic. The resync-rate controller will poll the situation both at 100ms AND when all resync requests are complete, even before the 100ms timer expires.

We estimate that these improvements will only be beneficial to storage and networks that are capable of going faster than 800MiB/s. For our benchmarks, I used two identical systems running CentOS 7.6.1810. Both are equipped with 3x Samsung 960 Pro M.2 NVMe drives. The drives are configured up in RAID0 via the Linux software RAID. My initial baseline test found that in this configuration, the disk can achieve a throughput around 3.1GiB/s (4k sequential writes). For the replication network, we have dual port 40Gb/s Mellanox ConnectX-5 devices. They’re configured as a bonded interface using mode 2 (Balance XOR). This crossover network was benchmarked at 37.7Gb/s (4.3GiB/s) using iperf3. These are actually the same systems used in my NVMe-oF vs. iSER test here.

Then, I ran through multiple resyncs as I found out how best to tune the DRBD configuration for this environment. It was no surprise that DRBD 9.0.17 looked to be the clear winner. However, this was fully-tuned. I wanted to find out what would happen if we scaled back the tuning slightly. I then decided to test while also using the default values for sndbuf-size and rcvbuf-size. The results were fairly similar, but surprisingly 9.0.16 did a little better. For my “fully tuned” test, my configuration was the default except for the following tuning:

net {
max-buffers 80k;
sndbuf-size 10M;
rcvbuf-size 10M;
}
disk {
resync-rate 3123M; # bytes/second, default
c-plan-ahead 20; # 1/10 seconds, default
c-delay-target 10; # 1/10 seconds, default
c-fill-target 1M; # bytes, default
c-max-rate 3123M; # bytes/second, default
c-min-rate 250k; # bytes/second, default
}

Three tests were ran. All tests were done with a 500GiB LVM volume using the /dev/md0 device as the physical volume. A blkdiscard was run against the LVM between each test. The first test was with no application IO to the /dev/drbd0 device and without the sndbuf-size and rcvbuf-size. The second was with no application IO, and fully-tuned as per the configuration above. The third test was again fully-tuned with the configuration above. However, immediately after the resource was promoted to primary, I began a loop that used ‘dd’ to write 1M chunks of zeros to the disk sequentially.

From the graph above, we can see that – when idle – DRBD 9.0.17 can resync at the speeds the physical disk will allow, where as 9.0.16 seems to top out around 2000MiB/s. DRBD 8.4.11 can’t seem to push past 1500MiB/s. However, when we introduce IO to the DRBD virtual disk, both DRBD 9 versions scale back the resync speeds to roughly the same speeds. Surprisingly, DRBD 8.4 doesn’t throttle down as much and hovers around 700MiB/s. This is most likely due to an increase in IO lockout granularity between versions 8 and 9. However, faster here is not necessarily desired. It is usually favorable to have the resync speeds throttled down in order to “step aside” and allow application IO to take priority.

Have questions? Submit them below. We’re happy to help!

Devin Vance on Linkedin
Devin Vance
First introduced to Linux back in 1996, and using Linux almost exclusively by 2005, Devin has years of Linux administration and systems engineering under his belt. He has been deploying and improving clusters with LINBIT since 2011. When not at the keyboard, you can usually find Devin wrenching on an American motorcycle or down at one of the local bowling alleys.
NFS Ansible Playbook

Request Your Highly Available NFS Ansible Playbook!

LINBIT wants to make your testing of our software easy! We’ve started creating Ansible playbooks that automate the deployment of our most commonly clustered software stacks.

The HA NFS playbook will automate the deployment of a HA NFS Cluster, using DRBD9 for replicated storage and Pacemaker as the cluster resource manager; quicker than you can brew a pot of coffee.

Email us for your free trial!

Be sure to write “Ansible” in the subject line.

Build an HA NFS Cluster using Ansible with packages from LINBIT.

System requirements:

  • An account at https://my.linbit.com(contact [email protected]).
  • Deployment environment must have Ansible2.7.0+andpython-netaddr.
  • All target systems must have passwordless SSH access.
  • All hostnames used in inventory file are resolvable (better to use IP addresses).
  • Target systems are CentOS/RHEL 7.

More information is available: LINBIT Ansible NFS Cluster on GitHub.

Linbeat - band, linux

A new star is born: LIN:BEAT – The Band

LINBIT is opening up a new division to specifically address our community’s desire to turn the music up to 11! As you all know, LINBIT is famous for its DRBD, Software-Defined Storage (SDS) and Disaster Recovery (DR) solutions. In a paradigm-shifting turn of events by management, LINBIT has decided to expand into the music industry. Since there is so much business potential in playing live concerts, LINBIT has transformed five of their employees into LIN:BEAT – The Band. These concerts are of course made highly-available by utilizing LINBIT’s own DRBD software.

Linbeat - band, linux

The Band will be touring all Cloud and Linux events around the globe in 2019. Band members use self-written code to produce their unique sound design, reminiscent of drum and bass and heavily influenced by folk punk. The urge to portray all the advantages of DRBD and LINSTOR is so strong they had to send a message to the world: Their songs tell the world about the ups and downs of administrators who handle big storage clusters. LIN:BEAT offers a variety of styles in their musical oeuvre: While “LINSTOR” is a heavy-driven rock song and “Snapshots” speaks to funk-loving people, even “Disaster Recovery,” a love ballad, made it into their repertoire. Lead singer, Phil Reisner, sings sotto voce about his lost love — a RAID rack called “SDS”. Reisner told the reporters, “Administrators are such underrated people. This is sadly unfortunate. We strive to give all administrators a voice. Even if it’s a musical one!”

Crowds will be jumping up and down in excitement when LIN:BEAT comes to town! Be there and code fair!

An Excerpt of the song “My first love is DRBD” written by Phil Reisner:

My first love is DRBD,

there has never been a fee

instead it serves proudly as open source

your replication has changed the course

It’s the crucial key for Linux’s destiny

Rest Easy

Once upon a time, in an office far far away, LINSTOR was at v0.2 when I started using our own python API to write an OpenStack volume driver for LINSTOR. This LINSTOR API would allow a python script to provision and manage LINSTOR volumes. Therefore, Linstor.resource_create(rsc_name=”mine”, node_name=”not_yours”) would create a LINSTOR resource called “mine” on a computer “not_yours.”  Similarly, Linstor.node_list() would return the list of storage nodes in the current LINSTOR cluster.

After a healthy amount of coffee and snacks, my driver started making progress creating volumes and snapshots in OpenStack. As the project progressed and I became more comfortable using the API, I wrote a small GUI prototype to mimic the functionality of using LINSTOR volume provisioning in OpenStack. I found a python-based GUI library called REMI which offered rich GUI features within a single python library. The REMI offered fast prototyping with minimum overhead and even cross-platform deployment.

I wrote a small proof-of-concept GUI script called LINSTOR View and it manages LINSTOR volumes with a graphical interface. REMI provides the UI while my script uses the API calls to manage backend storage. The prototype could list, create, and delete LINSTOR volumes.

Fast forward to 2019: LINSTOR v0.9.2 was just released along with DRBD v9.0.17. One of the many new features of LINSTOR is a REST API. Just like any typical REST implementation, a POST request will create a LINSTOR asset while DELETE does the opposite. Similarly, a PUT request will modify an asset and so on. Software-defined-storage (SDS) with LINSTOR gets even easier with this API. I believe this REST API will allow for easier development with LINSTOR and faster integration with other platforms.

The release notes are available here, along with a few other goodies. But without further stealing the thunder from Rene Peinthor and the rest of the Viennese development team, I bid auf Wiedersehen. I look forward to new developments as LINSTOR nears v1.0.0.

For any questions or comments, please feel free to reach me in the comments below or at [email protected].

Woojay Poynter
IO Plumber
Woojay is working on data replication and software-defined-storage with LINSTOR, built on DRBD @LINBIT. He has worked on web development, embedded firmwares, professional culinary education, power carving with ice and wood. He is a proud father and likes to play with legos.

NVMe Over Fabrics

NVMe-oF and iSCSI (iSER)

As a Linux storage company, LINBIT is always excited for an opportunity to work with the latest developments in storage. One of these new technologies is NVMe over fabrics (NVMe-oF). NVMe is a device specification for nonvolatile memory that utilizes the PCIe bus. NVMe-oF allows us to use these specifications over a network. You can somewhat think of this in similar terms to SCSI and iSCSI. Also, much like iSCSI, it isn’t actually required that you use a NVMe device as the backing storage for a NVMe-oF device. This makes NVMe-oF a great way to attach DRBD-backed storage clusters to hypervisors, container hosts, or applications of numerous types.

The parallels were obvious between NVMe-oF and iSCSI, so I naturally wanted to do some testing with it to see how it compared with iSCSI. I had originally intended to compare iSCSI to NVMe over TCP, but soon found out those patches were not yet merged upstream. As I was still intending to test using Ethernet interfaces, I then quickly steered towards RoCE (RDMA over Converged Ethernet). Then, in order to make a more fair comparison, I used the iSER (iSCSI Extensions for RDMA) transport for iSCSI.

The systems in use are relatively new Intel i7 machines (i7-7820X). The single CPU has 16 threads and a clock speed of 3.6GHz. The systems both have 64GiB of 2133 DDR4 memory. The storage is 3x 512GiB Samsung 970 PRO configured in raid 0 via Linux software RAID. The network between the initiator and target was two directly connected Mellanox Connext-5 interfaces bonded using mode 2 (balancing xor).

The tests were all focused on IO operations per second on 4k block sizes. Backing disks were configured to use the mq-deadline scheduler. All tests were performed using fio version 3.13. The direct IO test ran for 30 seconds using 16 jobs and an iodepth of 32. The libaio ioengine was used for all tests. The exact fio command can be found in the footnotes. 1

Much to my surprise, it seems that iSCSI with iSER outperformed NVMe-oF in sequential writes. However, it seems that iSCSI really struggled with random IO, both in reads and writes. In regard to random IO, NVMe-oF outperformed iSCSI by roughly 550%. With the exception of the iSCSI random IO and the NVMe-oF Sequential writes, most tests performed nearly on par with the raw hardware when tested locally, coming in at well over 1 million iops! If you have any random IO intensive workloads, it might be time to consider implementing NVMe-oF.

Let us know in the comments below if you’re looking to make the jump to NVMe-oF, if you’ve already made the jump and the differences you’ve seen, or if you have any questions regarding our test environment.

Footnotes

1. /usr/local/bin/fio --name test$i --filename $BLOCKDEVICE --ioengine libaio --direct 1 --rw $IOPATTERN --bs=4k --runtime 30s --numjobs 16 --iodepth 32 --group_reporting --append-terse

Devin Vance on Linkedin
Devin Vance
First introduced to Linux back in 1996, and using Linux almost exclusively by 2005, Devin has years of Linux administration and systems engineering under his belt. He has been deploying and improving clusters with LINBIT since 2011. When not at the keyboard, you can usually find Devin wrenching on an American motorcycle or down at one of the local bowling alleys.

man-person-jumping-desert

LINSTOR grows beyond DRBD

For quite some time, LINSTOR has been able to use NVMe-oF storage targets via the Swordfish API. This was expressed in LINSTOR as a resource definition that contains a single resource with one backing disk (that is the NVMe-oF target) and one diskless resource (that is the NVMe-oF initiator).

Layers in the storage stack

In the last few months the team has been busy making LINSTOR more generic, adding support for resource templates. A resource template describes a storage stack in terms of layers for specific resources/volumes. Here are some examples of such storage stacks:

    • DRBD on top of logic volumes (LVM)
    • DRBD on top of zVols (ZFS)
    • Swordfish initiator & target on top of logic volumes (LVM)
    • DRBD on top of LUKS on top of logic volumes (LVM)
  • LVM only

The team came up with an elegant approach that introduces these additional resource templates in ways that allow existing LINSTOR configurations to keep their semantics as the default resource templates.

With this decoupling, we no longer need to have DRBD installed on LINSTOR clusters that do not require the replication functions of DRBD.

What does that mean for DRBD?

The interests of LINBIT’s customers vary widely. Some want to use LINSTOR without DRBD – which is now supported. A very prominent example of this is Intel, who uses LINSTOR in its Rack Scale Design effort to connect storage nodes and compute nodes with NVMe-oF. In this example, the storage is disaggregated from the other nodes.

Other customers see converged architectures as a better fit. For converged scenarios, DRBD has many advantages over a pure data access protocol such as NVMe-oF. LINSTOR is built from the ground up to manage DRBD, therefore, the need for DRBD support will remain.

Linux-native NVMe-oF and NVMe/TCP

SNIA’s Swordfish has clear benefits with creating a standard for managing storage targets such as allowing optimized storage target implementations, as well as a hardware-accelerated data-path, non-Linux control path.

Due to the fact that Swordfish is an extension of Redfish, which needs to be implemented in the Baseboard Management Controller (BMC), we have decided to extend LINSTOR’s driver set to configure NVMe-oF target and initiator software. We do this by utilizing existing tools found within the Linux operating system, eliminating the need for a Swordfish software stack.

Summary

LINSTOR now supports configurations without DRBD. It is now a unified storage orchestrator for replicated and non-replicated storage.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

Nvme-oF-Linstor-speed

Speed Up! NVMe-oF for LINSTOR

What is NVMe?

The storage world has gained a number of new terms in the last few years. Let’s start with NVMe. The abbreviation stands for Non-Volatile Memory express, which isn’t very self-explanatory. It all began a few years back when NAND Flash started to make major inroads into the storage industry, and the new storage medium needed to be accessed through existing interfaces like SATA and Serial attached SCSI (SAS).

Back at that time, FusionIO created a NAND flash-based SSD that was directly plugged into the PCIe slot of a server. This eliminated the bottleneck of the ATA or SCSI command sets and the interfaces coming from a time of rotating storage media.

The FusionIO products shipped with proprietary drivers, and the industry set forth in creating an open standard that suits the performance of NAND flash. One of the organizations where the players of the industry can meet, align and create standards is the Storage Networking Industry Association ( SNIA).

The first NVMe standard was published in 2013, and it describes a PCIe-based interface and command set to access fast storage. This can be thought of as a cleaned up version of the ATA or SCSI commands plus a PCIe interface.

What is NVMe-oF and NVMe/TCP?

Similar to what iSCSI is to SCSI, NVMe-oF or NVMe/TCP are standards that describe how to send the NVMe commands over networks. NVMe-oF requires a RDMA-capable network (like InfiniBand or RoCE), and NVMe/TCP works on every network that can carry IP traffic.

There are two terms of which to be aware: 1) the initiator is where the applications run that want to access the dataset. Linux comes with a built-in initiator, likewise other OSes already have initiators or will have them soon.

And, 2) the target is where the data is stored. Linux comes with a software target built into the kernel. It might not be obvious that any Linux block device can be made available as a NVMe-oF target using the Linux target software. It is not limited to NVMe devices.

What does this have to do with Swordfish?

While the iSCSI or NVMe-oF standards describe how the READ, WRITE and other operations on block data are shipped from the initiator to the target, they do not describe how a target (volume) gets created or configured. For too many years, this was the realm of vendor specific APIs and GUIs.

SNIA’s Swordfish standard describes how to manage storage targets and make it accessible as NVMe-oF targets. It is a REST API with JSON data. As such, it is easy to understand and embrace.

The major drawback of Swordfish is mainly that it is defined as an extension of Redfish. Redfish is a standard to manage servers over the network. It can be thought of as a modernized IPMI. As such, Redfish will usually be implemented on a Baseboard Management Controller (BMC). While Redfish has its advantages over IPMI, it does not provide something completely new.

On the other hand, Swordfish is something that was not there before, but as it is an extension to Redfish, an implementation of it usually means that the BMC of the machine needs to have a Redfish-enabled BMC, which may hinder or slow down the adoption of Swordfish.

LINSTOR

Since version 0.7, LINSTOR is capable of working with storage provided by Swordfish-compliant storage targets, as well as their initiator counterparts.

Summary

LINSTOR has gained the capability of managing storage on Swordfish/NVMe-oF targets besides working with DRBD and direct attached storage on Linux servers.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.