Posts

Performance drbd10 drbd9

Performance Gains with DRBD 10

A key factor in evaluating storage systems is their performance. LINBIT has been working to further improve the performance of DRBD. The recent DRBD 10 alpha release demonstrates significant gains.

The performance gains particularly help with highly concurrent workloads. This is an area that has been steadily rising in importance, and looks set to continue to rise. Improvements in single core speed appear to be stagnating while the availability of ever increasing numbers of cores is growing. Hence software systems need to utilize concurrency effectively to make the most of the computing resources.

We tested DRBD 10 with 4K random writes and various concurrency levels. In this test, the data is being replicated synchronously (“protocol C”) between two nodes. These numbers are for a single volume, not an aggregate over many volumes. I/O was generated by 8 processes. The tests show improvements in raw random write performance of up to 68%.

drbd10 performance gains

These improvements were achieved by using a finer-grained locking scheme. This allows, for instance, one core to be sending a request while a second core is submitting the next request. The result is better utilization of the available cores and overall higher throughput.

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. The software versions used were DRBD 10.0.0a1 and its most recent ancestor from the DRBD 9 branch (8e93a5d93b62). I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8 --ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1 --runtime=60 --size=48G --filename=/dev/drbd500

Ongoing development on DRBD 10

LINBIT is working on a number of exciting major features for DRBD 10.

  • Request forwarding. DRBD will send data to geographically distant sites only once and it will be replicated there.
  • PMEM journaling. DRBD can already access its metadata in a PMEM optimized fashion. That will be extended to using a PMEM device as a write-back cache, resulting in improved performance in latency-sensitive scenarios.
  • Erasure coding. DRBD will be able to erasure code and distribute its data. This provides the same functionality as RAID5/6, but with an arbitrary number of parity nodes. The result is lower disk usage with similar redundancy characteristics.

Stable releases of DRBD 10 are planned for 2020 – until then stay tuned for upcoming updates!

 

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.
man-person-jumping-desert

LINSTOR grows beyond DRBD

For quite some time, LINSTOR has been able to use NVMe-oF storage targets via the Swordfish API. This was expressed in LINSTOR as a resource definition that contains a single resource with one backing disk (that is the NVMe-oF target) and one diskless resource (that is the NVMe-oF initiator).

Layers in the storage stack

In the last few months the team has been busy making LINSTOR more generic, adding support for resource templates. A resource template describes a storage stack in terms of layers for specific resources/volumes. Here are some examples of such storage stacks:

    • DRBD on top of logic volumes (LVM)
    • DRBD on top of zVols (ZFS)
    • Swordfish initiator & target on top of logic volumes (LVM)
    • DRBD on top of LUKS on top of logic volumes (LVM)
  • LVM only

The team came up with an elegant approach that introduces these additional resource templates in ways that allow existing LINSTOR configurations to keep their semantics as the default resource templates.

With this decoupling, we no longer need to have DRBD installed on LINSTOR clusters that do not require the replication functions of DRBD.

What does that mean for DRBD?

The interests of LINBIT’s customers vary widely. Some want to use LINSTOR without DRBD – which is now supported. A very prominent example of this is Intel, who uses LINSTOR in its Rack Scale Design effort to connect storage nodes and compute nodes with NVMe-oF. In this example, the storage is disaggregated from the other nodes.

Other customers see converged architectures as a better fit. For converged scenarios, DRBD has many advantages over a pure data access protocol such as NVMe-oF. LINSTOR is built from the ground up to manage DRBD, therefore, the need for DRBD support will remain.

Linux-native NVMe-oF and NVMe/TCP

SNIA’s Swordfish has clear benefits with creating a standard for managing storage targets such as allowing optimized storage target implementations, as well as a hardware-accelerated data-path, non-Linux control path.

Due to the fact that Swordfish is an extension of Redfish, which needs to be implemented in the Baseboard Management Controller (BMC), we have decided to extend LINSTOR’s driver set to configure NVMe-oF target and initiator software. We do this by utilizing existing tools found within the Linux operating system, eliminating the need for a Swordfish software stack.

Summary

LINSTOR now supports configurations without DRBD. It is now a unified storage orchestrator for replicated and non-replicated storage.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.
Nvme-oF-Linstor-speed

Speed Up! NVMe-oF for LINSTOR

What is NVMe?

The storage world has gained a number of new terms in the last few years. Let’s start with NVMe. The abbreviation stands for Non-Volatile Memory express, which isn’t very self-explanatory. It all began a few years back when NAND Flash started to make major inroads into the storage industry, and the new storage medium needed to be accessed through existing interfaces like SATA and Serial attached SCSI (SAS).

Back at that time, FusionIO created a NAND flash-based SSD that was directly plugged into the PCIe slot of a server. This eliminated the bottleneck of the ATA or SCSI command sets and the interfaces coming from a time of rotating storage media.

The FusionIO products shipped with proprietary drivers, and the industry set forth in creating an open standard that suits the performance of NAND flash. One of the organizations where the players of the industry can meet, align and create standards is the Storage Networking Industry Association ( SNIA).

The first NVMe standard was published in 2013, and it describes a PCIe-based interface and command set to access fast storage. This can be thought of as a cleaned up version of the ATA or SCSI commands plus a PCIe interface.

What is NVMe-oF and NVMe/TCP?

Similar to what iSCSI is to SCSI, NVMe-oF or NVMe/TCP are standards that describe how to send the NVMe commands over networks. NVMe-oF requires a RDMA-capable network (like InfiniBand or RoCE), and NVMe/TCP works on every network that can carry IP traffic.

There are two terms of which to be aware: 1) the initiator is where the applications run that want to access the dataset. Linux comes with a built-in initiator, likewise other OSes already have initiators or will have them soon.

And, 2) the target is where the data is stored. Linux comes with a software target built into the kernel. It might not be obvious that any Linux block device can be made available as a NVMe-oF target using the Linux target software. It is not limited to NVMe devices.

What does this have to do with Swordfish?

While the iSCSI or NVMe-oF standards describe how the READ, WRITE and other operations on block data are shipped from the initiator to the target, they do not describe how a target (volume) gets created or configured. For too many years, this was the realm of vendor specific APIs and GUIs.

SNIA’s Swordfish standard describes how to manage storage targets and make it accessible as NVMe-oF targets. It is a REST API with JSON data. As such, it is easy to understand and embrace.

The major drawback of Swordfish is mainly that it is defined as an extension of Redfish. Redfish is a standard to manage servers over the network. It can be thought of as a modernized IPMI. As such, Redfish will usually be implemented on a Baseboard Management Controller (BMC). While Redfish has its advantages over IPMI, it does not provide something completely new.

On the other hand, Swordfish is something that was not there before, but as it is an extension to Redfish, an implementation of it usually means that the BMC of the machine needs to have a Redfish-enabled BMC, which may hinder or slow down the adoption of Swordfish.

LINSTOR

Since version 0.7, LINSTOR is capable of working with storage provided by Swordfish-compliant storage targets, as well as their initiator counterparts.

Summary

LINSTOR has gained the capability of managing storage on Swordfish/NVMe-oF targets besides working with DRBD and direct attached storage on Linux servers.

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.