cloudfest rust

Meet us at Cloudfest in Rust, Germany!

Come and meet LINBIT at CloudFest 2019 in Europa-Park, Rust, Germany. (23rd – 29rd of March 2019)

cloudfest 2019 booth linbit

CloudFest has made a name over the last few years as one of the the best cloud-focused industry events in which to network and have a good time. This year more than 7,000 people are attending the event. Attendees will hear from leaders in the business and get the latest industry buzz. 

The speaker line-up includes names like Dr. Ye Huang, Head of Solution Architects at Alibaba, Will Pemble, CEO at Goal Boss, Bhavin Turakhia, CEO at Flock, or Brian Behlendorf, Inventor of the Apache Web server. 

VISIT us at Booth H24!

LINBIT is announcing some exciting news at Cloudfest: NVMe-oF with LINSTOR! Meaning LINSTOR can now be used as a standalone product, independent from DRBD. NVMe-oF supports Infiniband with RDMA and allows ultrafast performance, easily handling workloads for Big Data Analytics or Artificial Intelligence. Come say hello at Cloudfest! Visit us at Booth H24. 

We are looking forward to you!

Booth visitors will be rewarded with a surprise that even your family will love! 🙂 

March 2019 Newsletter

NVMe Over Fabrics

NVMe-oF and iSCSI (iSER)

As a Linux storage company, LINBIT is always excited for an opportunity to work with the latest developments in storage. One of these new technologies is NVMe over fabrics (NVMe-oF). NVMe is a device specification for nonvolatile memory that utilizes the PCIe bus, and NVMe-oF allows us to use these specifications over a network. You can somewhat think of this in similar terms to SCSI and iSCSI. Also, much like iSCSI, it isn’t actually required that you use a NVMe device as the backing storage for a NVMe-oF device. This makes NVMe-oF a great way to attach DRBD-backed storage clusters to hypervisors, container hosts, or applications of numerous types.

The parallels were obvious between NVMe-oF and iSCSI, so I naturally wanted to do some testing with it to see how it compared with iSCSI. I had originally intended to compare iSCSI to NVMe over TCP, but soon found out those patches were not yet merged upstream. As I was still intending to test using Ethernet interfaces, I then quickly steered towards RoCE (RDMA over Converged Ethernet). Then, in order to make a more fair comparison, I used the iSER (iSCSI Extensions for RDMA) transport for iSCSI.

The systems in use are relatively new Intel i7 machines (i7-7820X). The single CPU has 16 threads and a clock speed of 3.6GHz. The systems both have 64GiB of 2133 DDR4 memory. The storage is 3x 512GiB Samsung 970 PRO configured in raid 0 via Linux software RAID. The network between the initiator and target was two directly connected Mellanox Connext-5 interfaces bonded using mode 2 (balancing xor).

The tests were all focused on IO operations per second on 4k block sizes. Backing disks were configured to use the mq-deadline scheduler. All tests were performed using fio version 3.13. The direct IO test ran for 30 seconds using 16 jobs and an iodepth of 32. The libaio ioengine was used for all tests. The exact fio command can be found in the footnotes. 1

 

Much to my surprise, it seems that iSCSI with iSER outperformed NVMe-oF in sequential writes. However, it seems that iSCSI really struggled with random IO, both in reads and writes. In regard to random IO, NVMe-oF outperformed iSCSI by roughly 550%. With the exception of the iSCSI random IO and the NVMe-oF Sequential writes, most tests performed nearly on par with the raw hardware when tested locally, coming in at well over 1 million iops! If you have any random IO intensive workloads, it might be time to consider implementing NVMe-oF.

Let us know in the comments below if you’re looking to make the jump to NVMe-oF, if you’ve already made the jump and the differences you’ve seen, or if you have any questions regarding our test environment.

Footnotes

1. /usr/local/bin/fio --name test$i --filename $BLOCKDEVICE --ioengine libaio --direct 1 --rw $IOPATTERN --bs=4k --runtime 30s --numjobs 16 --iodepth 32 --group_reporting --append-terse

Devin Vance on Linkedin
Devin Vance
First introduced to Linux back in 1996, and using Linux almost exclusively by 2005, Devin has years of Linux administration and systems engineering under his belt. He has been deploying and improving clusters with LINBIT since 2011. When not at the keyboard, you can usually find Devin wrenching on an american motorcycle or down at one of the local bowling alleys.

man-person-jumping-desert

LINSTOR grows beyond DRBD

For some time, LINSTOR has been able to use NVMe-oF storage targets via the Swordfish API. This was expressed in LINSTOR as a resource definition that contains a single resource with one backing disk (that is the NVMe-oF target) and one diskless resource (that is the NVMe-oF initiator).

Layers in the storage stack

In the last few months the team has been busy making LINSTOR more generic, adding support for resource templates. A resource template describes a storage stack in terms of layers for specific resources/volumes. Here are some examples of such storage stacks:

    • DRBD on top of logic volumes (LVM)
    • DRBD on top of zVols (ZFS)
    • Swordfish initiator & target on top of logic volumes (LVM)
    • DRBD on top of LUKS on top of logic volumes (LVM)
  • LVM only

The team came up with an elegant approach which introduces these additional resource templates in ways that existing LINSTOR configurations keep their semantics as the default resource templates..

With this decoupling, we no longer need to have DRBD installed on LINSTOR clusters that do not require the replication functions of DRBD.

What does that mean for DRBD?

The interests of LINBIT’s customers vary widely. Some want to use LINSTOR without DRBD – this is now supported. A very prominent example for this category is Intel, who uses LINSTOR in its Rack Scale Design effort to connect storage nodes and compute nodes with NVMe-oF. In this example, the storage is disaggregated from the other nodes.

Other customers see converged architectures as a better fit. For converged scenarios, DRBD has many advantages over a pure data access protocol such as NVMe-oF. LINSTOR is built from the ground up to manage DRBD, as such DRBD support will remain.

Linux native NVMe-oF and NVMe/TCP

SNIA’s Swordfish has clear benefits with creating a standard for managing storage targets such as allowing optimized storage target implementations as well as a hardware-accelerated data-path, non-Linux control path.

Due to the fact that Swordfish is an extension of Redfish, which needs be implemented in the BMC (Baseboard Management Controller), we have decided to extend LINSTOR’s driver set to configure NVMe-oF target and initiator software. We do this by utilizing existing tools found within the Linux operating system,  eliminating the need for a Swordfish software stack.

Summary

LINSTOR now supports configurations without DRBD. It is now a unified storage orchestrator for replicated and non-replicated storage.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

Nvme-oF-Linstor-speed

Speed Up! NVMe-oF for LINSTOR

What is NVMe?

The storage world has gained a number of new terms in the last few years. Let’s start with NVMe. The abbreviation stands for Non-Volatile Memory express, which isn’t very self-explanatory. It all began a few years back when NAND Flash started to make major inroads into the storage industry, and the new storage medium needed to be accessed through existing interfaces like SATA and SAS (Serial attached SCSI).

Back at that time, FusionIO created a NAND flash-based SSD that was directly plugged into the PCIe slot of a server. This eliminated the bottleneck of the ATA or SCSI command sets and the interfaces coming from a time of rotating storage media.

The FusionIO products shipped with proprietary drivers, and the industry set forth to create an open standard that suits the performance of NAND flash. One of the organizations where the players of the industry can meet, align and create standards is the SNIA (Storage Networking Industry Association).

The first NVMe standard was published in 2013, and it describes a PCIe based interface and command set to access fast storage. This can be thought of as a cleaned up version of the ATA or SCSI commands plus a PCIe interface.

What is NVMe-oF and NVMe/TCP?

Similar to what iSCSI is to SCSI, NVMe-oF or NVMe/TCP are standards that describe how to send the NVMe commands over networks. NVMe-oF requires an RDMA capable network (like InfiniBand or RoCE), and NVMe/TCP works on every network that can carry IP traffic.

There are two terms to be aware of. The initiator is where the applications run that want to access the dataset. Linux comes with a built-in initiator, likewise other OSes already have initiators or will have them soon.

The target is where the data is stored. Linux comes with a software target built into the kernel. It might not be obvious that any Linux block device can be made available as a NVMe-oF target using the Linux target software. It is not limited to NVMe devices.

What does this have to do with Swordfish?

While the iSCSI or NVMe-oF standards describe how the READ, WRITE and other operations on block data are shipped from the initiator to the target, they do not describe how a target (volume) gets created or configured. For too many years, this was the realm of vendor specific APIs and GUIs.

SNIA’s Swordfish standard describes how to manage storage targets and make it accessible as NVMe-oF targets. It is a REST API with JSON data. As such, it is easy to understand and embrace.

The major drawback of Swordfish is probably that it is defined as an extension of Redfish. Redfish is a standard to manage servers over the network.  It can be thought of as a modernized IPMI. As such, Redfish will usually be implemented on a BMC (Baseboard Management Controller). While Redfish has its advantages over IPMI, it does not provide something completely new.

On the other hand, Swordfish is something that was not there before, but as it is an extension to Redfish, an implementation of it usually means that the BMC of the machine needs to have a Redfish-enabled BMC. That may hinder or slow down the adoption of Swordfish.

LINSTOR

Since version 0.7, LINSTOR is capable of working with storage provided by Swordfish compliant storage targets as well as their initiator counterparts.

Summary

LINSTOR has gained the capability of managing storage on Swordfish/NVMe-oF targets besides working with DRBD and direct attached storage on Linux servers.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.

dj-mixer-api

High Level Resource API – The simplicity of creating replicated volumes

In this blog post we present one of our recent extensions to the LINSTOR ecosystem: A high-level, user-friendly Python API that allows simple DRBD resource management via LINSTOR.

Background: So far LINSTOR components communicated by the following means: Via Protocol Buffers, or via the Python API that is used in the linstor command line client. Protocol Buffers are a great way to transport serialized structured data between LINSTOR components, but by themselves they don’t provide the necessary abstraction for developers.

That is not the job of Protocol Buffers. Since the early days we split the command line client into the client logic (parsing configuration files, parsing command line arguments…), and a Python library (python-linstor). This Python library provides all the bits and pieces to interact with LINSTOR. For example it provides a MultiLinstor class that handles TCP/IP communication to the LINSTOR controller. Additionally, it allows all the operations that are possible with LINSTOR (e.g. creating nodes, creating storage pools…). For perfectly valid reasons this API is very low level and pretty close to the actual Protocol Buffer messages sent to the LINSTOR controller.

By developing more and more plugins to integrate LINSTOR into other projects like OpenStack, OpenNebula, Docker Volumes, and many more, we saw that there is need for a higher level abstraction.

Finding the Right Abstraction

The first dimension of abstraction is to abstract from LINSTOR internals. For example it perfectly makes sense that recreating an existing resource is an error on a low level (think of it as EEXIST). On a higher level, depending on the actual object, trying to recreate an object might be perfectly fine and one wants to get the existing object (i.e. idem-potency).

The second dimension of abstraction is from DRBD and LINSTOR as a whole. Developers dealing with storage already have a good knowledge about concepts like nodes, storage pools, resource, volumes, placement policies… This is the part where we can make LINSTOR and DRBD accessible for new developers.

The third goal was to only provide a set of objects that are important in the context of the user/developer. This, for example, means that we can assume that the LINSTOR cluster is already set up, so we do not need to provide a high-level API to add nodes. For the higher-level API we can focus on [LINSTOR] resources. This allows us to satisfy the KISS (keep-it-simple-stupid) principle. A forth goal was to introduce new, higher-level concepts like placement policies. Placement policies/templates are concepts currently developed in core LINSTOR, but we can already provide basics on a higher level.

Demo Time

We start by creating a 10 GB big replicated LINSTOR/DRBD volume in a 3 node cluster. We want the volume to be 2 times redundant. Then we increase the size of the volume to 20 GB.

$ python
>> import linstor

>> foo = linstor.Resource('foo')

>> foo.volumes[0] = linstor.Volume("10 GB")

There are multiple ways to specify the size.

>> foo.placement.redundancy = 2

>> foo.autoplace()

>> foo.volumes[0].size += 10 * (2 ** 30)

This line is enough to resize a replicated volume cluster wide.

We needed 5 lines of code to create a replicated DRBD volume in a cluster! Let that sink in for a moment and compare it to the steps that were necessary without LINSTOR: Creating backing devices on all nodes, writing and synchronizing DRBD res(ource) files, creating meta-data on all nodes, drbdadm up the resource and force one to the Primary role to start the initial sync.

For the next step we assume that the volume is replicated and that we are a storage plugin developer. Our goal is to make sure the volume is accessible on every node because the block device should be used in a VM. So, A) make sure we can access the block device, and B) find out what the name of the block device of the first volume actually is:

 

>>> foo.activate(socket.gethostname())

>>> print(foo.volumes[0].device_path)

 

The method activate is one of these methods that shows how we intended abstraction. Note that we autoplaced the resource 2 times in a 3-node cluster. So LINSTOR chose the nodes that fit best. But now we want the resource to be accessible on every node without increasing the redundancy to 3 (because that would need additional storage and 2 times replicated data is good enough).

Diskless clients

Fortunately DRBD has us covered as it has the concept of diskless clients. These nodes provide a local block device as usual, but they read and write data from/to their peers only over the network (i.e. no local storage). Creating this diskless assignment is not necessary if the node was already part of the replication in the first place (then it already has access to the data locally).

This is exactly what activate does: If the node can already access the data – fine, if not, create a diskless assignment. Now assume we are done and we do not need access to the device anymore. We want to do some cleanup because we do not need a diskless assignment:

 

>>> foo.deactivate(socket.gethostname()) 

 

The semantic of this method is to remove the assignment if it is diskless (as it does not contribute to actual redundancy), but if it is a node that stores actual data, deactivate does nothing and keeps the data as redundant as it was. This is only a very small subset of the functionality the high-level API provides, there is a lot more to know like creating snapshots, converting diskless assignments to diskful ones and vice versa, or managing DRBD Proxy. For more information check the online documentation.

If you want to go deeper into the LINSTOR universe, please visit our youtube channel.

 

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.

February 2019 Newsletter