Posts

drbd-performance-pmem3

Optimizing DRBD for Persistent Memory

Persistent Memory (PMEM) is an exciting storage tier with much lower latency than SSDs. LINBIT has optimized DRBD for when its metadata is stored on PMEM/NVDIMM.

This article relates to both:

  • Traditional NVDIMM-N: Some DRAM is accompanied by NAND-flash. On power failure, a backup power source (supercap, battery) is used to save the contents of the DRAM to the flash storage. When the main power is restored, the contents of the DRAM are restored. These components have exactly the same timing characteristics as DRAM and are available in sizes of 8GB, 16GB or 32GB per DIMM.
  • Intel’s new Optane DC Persistent Memory: These DIMMs are built using a new technology called 3D XPoint. It is inherently non-volatile and has only slightly higher access times than pure DRAM. It comes in much higher capacities than traditional NVDIMMs: 128GB, 256GB and 512GB.

DRBD requires metadata to keep track of which blocks are in sync with its peers. This consists of 2 main parts. One part is the bitmap which keeps track of exactly which 4KiB blocks may be out of sync. It is used when a peer is disconnected to minimize the amount of data that must be synced when the nodes reconnect. The other part is the activity log. This keeps track of which data regions have ongoing I/O or had I/O activity recently. It is used after a crash to ensure that the nodes are fully in sync. It consists of 4MiB extents which, by default, cover about 5GiB of the volume.

Since the DRBD metadata is small and frequently accessed, it is a perfect candidate to be placed on PMEM. A single 8GiB NVDIMM can store enough metadata for 100 volumes of 1TiB each, allowing for replication between 3 nodes.

PMEM outperforms

DRBD 9 has been optimized to access metadata on PMEM directly using memory operations. This approach is extremely efficient and leads to significant performance gains. The improvement is most dramatic when the metadata is most often updated. This occurs when writes are performed serially. That is, the I/O depth is 1. When this is the case, scattered I/O forces the activity log to be updated on every write. Here we compare the performance between metadata on a separate NVMe device, and metadata on PMEM with and without the optimizations.

As can be seen, placing the DRBD metadata on a PMEM device results in a massive performance boost for this kind of workload.

drbd-performance-pmem

Impact with concurrent I/O

When I/O is submitted concurrently, DRBD does not have to access the metadata as often. Hence we do not expect the performance impact to be quite as dramatic. Nevertheless, there is still a significant performance boost, as can be seen.

If you have a workload with very high I/O depth, you may wish to trial DRBD 10, which performs especially well in such a situation. See https://www.linbit.com/en/drbd10-vs-drbd9-performance/.

drbd-performance-pmem-io

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. Each server had an 8GiB DDR4 NVDIMM from Viking installed. DRBD 9.0.17 was used to perform the tests without the PMEM optimizations and DRBD 9.0.20 for the remainder. I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8

--ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1

--runtime=60 --size=48G --filename=/dev/drbd500

 

If you got technical questions, don’t hesitate to subscribe to our email list

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.
Performance drbd10 drbd9

Performance Gains with DRBD 10

A key factor in evaluating storage systems is their performance. LINBIT has been working to further improve the performance of DRBD. The recent DRBD 10 alpha release demonstrates significant gains.

The performance gains particularly help with highly concurrent workloads. This is an area that has been steadily rising in importance, and looks set to continue to rise. Improvements in single core speed appear to be stagnating while the availability of ever increasing numbers of cores is growing. Hence software systems need to utilize concurrency effectively to make the most of the computing resources.

We tested DRBD 10 with 4K random writes and various concurrency levels. In this test, the data is being replicated synchronously (“protocol C”) between two nodes. These numbers are for a single volume, not an aggregate over many volumes. I/O was generated by 8 processes. The tests show improvements in raw random write performance of up to 68%.

drbd10 performance gains

These improvements were achieved by using a finer-grained locking scheme. This allows, for instance, one core to be sending a request while a second core is submitting the next request. The result is better utilization of the available cores and overall higher throughput.

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. The software versions used were DRBD 10.0.0a1 and its most recent ancestor from the DRBD 9 branch (8e93a5d93b62). I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8 --ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1 --runtime=60 --size=48G --filename=/dev/drbd500

Ongoing development on DRBD 10

LINBIT is working on a number of exciting major features for DRBD 10.

  • Request forwarding. DRBD will send data to geographically distant sites only once and it will be replicated there.
  • PMEM journaling. DRBD can already access its metadata in a PMEM optimized fashion. That will be extended to using a PMEM device as a write-back cache, resulting in improved performance in latency-sensitive scenarios.
  • Erasure coding. DRBD will be able to erasure code and distribute its data. This provides the same functionality as RAID5/6, but with an arbitrary number of parity nodes. The result is lower disk usage with similar redundancy characteristics.

Stable releases of DRBD 10 are planned for 2020 – until then stay tuned for upcoming updates!

 

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.

Coming Soon, a New DRBD Proxy Release

One major part of LINBIT Disaster Recovery is DRBD Proxy, which helps DRBD with long-distance real-time replication. DRBD Proxy mitigates bandwidth, latency, and distance issues by buffering writes into memory, ensuring that your WAN latency doesn’t become your disk throughput.

The upcoming release of the DRBD Proxy will come with a few new tools to improve data replication with compression. Its LZ4 plugin has been updated to the latest version 1.9.0 and Zstandard algorithm has been added as a brand-new plugin.

Both offer great balance of compression ratio and speed while offering higher replication performance on the DRBD end. In our test cases, both performed considerably better than without compression in overall read and write operations.

Here’s a short synopsis of some of the tests we ran. For this setup, we built a two-node DRBD cluster that was geographically separated. Both ran the latest yet-to-be-released version of the DRBD Proxy for various IO tests. The compression level for the Zstandard was 3, which is the default level between 0 and 22. LZ4 was set to level 9, which is the maximum level.

MySQL Read Write Operations with Sysbench

In this scenario, we used sysbench to perform random reads and writes to a MySQL database replicated on both nodes with DRBD Proxy and DRBD. Sysbench created a random database mounted on a 200MB DRBD volume with Thin LVM backing. Then it performed random transactions for 100 seconds.

The improved number of writes and overall transactions with compression is pretty clear compared to the ‘Proxy Only’ numbers. Interestingly, LZ4 and Zstandard both performed quite similarly.

MySQL Average Latency on MySQL RW Tests

The average latency from the same MySQL tests showed another interesting fact. When using DRBD Proxy, DRBD uses protocol A, which is an asynchronous mode. This setup in the test performed quite nicely compared to replicating with protocol C, the default synchronous mode. All three proxy configurations, regardless of the compression, performed very well against synchronous mode. The different modes of DRBD transport are explained here.

Other random IO tests performed with sysbench on the file system as well as fio tests at the block level mirrored the results shown above, where compression with proxy helped greatly with reducing the network payload while increasing overall read/write performance.

This was a quick preview of the upcoming DRBD Proxy release highlighting its compression plugins. Please stay tuned for the release announcement, and for any questions or comments, feel free to reach me in the comments below.

P.S. The test nodes were configured relatively light. The local node was a 4-core VM with 1GB of RAM running Ubuntu 18.04 and DRBD v9.0.18. The remote node was a 4-core VM with 4GB of RAM also running the same OS and DRBD. The WAN link was restricted to 2MB/s. The relevant sysbench commands used were:

sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-db=sbtest --db-driver=mysql --tables=10 --table-size=1000 prepare
sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-db=sbtest --db-driver=mysql --tables=10 --table-size=1000 --report-interval=10 --threads=4 --time=100 run
Woojay Poynter
IO Plumber
Woojay is working on data replication and software-defined-storage with LINSTOR, built on DRBD @LINBIT. He has worked on web development, embedded firmwares, professional culinary education, power carving with ice and wood. He is a proud father and likes to play with legos.

DRBD/LINSTOR vs Ceph – a technical comparison

INTRODUCTION

The aim of this article is to give you some insight into CEPH, DRBD and LINSTOR by outlining their basic functions. The following points should help you compare these products and to understand which is the right solution for your system. Before we start, you should be aware of the fact that LINSTOR is made for DRBD and that it is highly recommended for you to use LINSTOR if you are also using DRBD.

DRBD

DRBD works by inserting a thin layer in between the file system, the buffer cache, and the disk driver. The DRBD kernel module captures all requests from the file system and splits them down into two paths. So, how does the actual communication occur? How do two separate servers optimize data protection?

DRBD facilitates communication by mirroring two separate servers. One server, although passive, is usually a direct copy of the other. Any data written to the primary server is simultaneously copied to the secondary server through a real-time communication system. The passive server immediately replicates any changes made in the data.

DRBD 8.x works on two nodes at a time. One is given the role of the primary node while the other is given a secondary role. Reads and writes can only occur on the primary node.

THE BENEFITS OF DRBD 9

The features of DRBD 9.x are a vast improvement over the 8.x version. It is now possible to have up to 32 replicas, including the primary node. This gives you the ability to build your cluster setup with what we call diskless nodes, meaning you don’t have to use storage on your primary node. The primary node in diskless mode still has a DRBD block device, but the data is accessed on the secondary nodes over the network.

The secondary nodes must not mount the file system, not even in read-only mode. While it is true to say that the secondary nodes see all updates on the primary node, they can’t expose these updates to the file system, as DRBD is completely file system agnostic.

One write goes to the actual disk and another to the mirrored disks on a peer node. If the first one fails, the file system can be displayed on one of the opposing nodes and the data will be available for use.

DRBD has no precise knowledge of the file system and, as such, it has no way of communicating the changes upstream to the file system driver. The two-at-a-time rule does not actually limit DRBD from operating on more than two nodes.

Moreover, DRBD-9.x supports multiple peer nodes, meaning one peer might be a synchronous mirror in the local data-center while another secondary might be an asynchronous mirror in a remote site.

Again, the passive server only becomes functional when the primary one fails. When such a failure occurs, Pacemaker immediately recognizes the mishap and shifts to the secondary server. This shifting process, nevertheless, is optional – it can be either manual or automatic. For users who prefer manual, one is required to authorize the system to shift to the passive server when the primary one fails.

LINSTOR

In greater IT infrastructures, cluster managing software is state of the art. This is why LINBIT developed LINSTOR, a software on top of DRBD. DRBD itself is a perfect tool to replicate and access your data, especially when it comes to performance. LINSTOR makes configuring DRBD on a system with more than a few nodes an easy task. LINSTOR manages DRBD and gives you the ability to set it up on a large system.

LINSTOR uses a controller service for managing your cluster and a satellite service which runs on every node for deploying DRBD. The controller can be accessed from every node and enables you to monitor and configure your structure quickly. It can be controlled over REST from the outside and provides a very clear CLI. Furthermore, the LINSTOR REST-API gives you the ability to use LINSTOR volumes in Kubernetes, Proxmox VE, OpenNebula and Openstack.

LINSTOR has a feature to maintain the system at work: There is a separation of control plane vs. data plane. If you wanna upgrade or maintain LINSTOR, there is no downtime of the volumes. In comparison with Ceph, DRBD & LINSTOR are easier to troubleshoot, recover, repair, debug, and easier to intervene manually if required, also mainly due to its simplicity. For sys admin the better maintainability and a less complex environment can be crucial. The higher availability also results in a better reliability. For instance DRBD can be started/stopped manually even if LINSTOR is offline, or, for recovery purposes, even without DRBD installed (simply mount backend storage) – compared to that, trying to find any of your data on disks managed by Ceph can be a quite challenge if your Ceph system is down.

In summary, if you’re looking for increased performance, fast configuration, and filesystem-based storage for your applications, use LINSTOR and DRBD. If you’re looking to run LINSTOR with HA, however, you must use a third-party software such as Pacemaker.

 

CEPH

CEPH is an open source software intended to provide highly scalable object, block, and file-based storage in a unified system.

CEPH consists of a RADOS cluster and its interfaces. The RADOS cluster is a system with services for monitoring and storing data across many nodes. CEPH/RADOS is an object storage cluster with no single-point of failure. This is solved by using an algorithm which cuts the data into blocks and spreads them across the RADOS cluster by using self-managing services. The CRUSH algorithm is used to spread the data on upload and to put the blocks together if an object is requested. CEPH is able to use simple data replication as well as erasure coding for those striped blocks.

On top of the RADOS cluster, LIBRADOS is used to upload or request data from the cluster. CEPH uses LIBRADOS for interfaces CEPHFS, RBD and RADOSGW.

CEPHFS gives you the ability to create a filesystem on a host where the data is stored in the CEPH cluster. Additionally, for using CEPHFS, CEPH needs metadata servers which manage the metadata and balance the load for requests among each other.

RBD or RADOS block device is used for creating virtual block devices on hosts with a CEPH cluster, managing and storing the data in the background. Since RBD is built on LIBRADOS, RBD inherits LIBRADOS’s abilities, including read only snapshots and reverts to snapshot. By striping images across the cluster, CEPH improves read access performance for large block device images. The block device can be virtualized, providing block storage to virtual machines in virtualization platforms such as Apache CloudStack, OpenStack, OpenNebula, Ganeti, and Proxmox Virtual Environment.

RADOSGW is the REST-API for communicating with CEPH/RADOS when uploading and requesting data from the cluster.

In general, CEPH is an object storage cluster with the advantage that you do not have to worry about failing nodes or storage drives, because CEPH recognizes failing devices and replicates the data instantly to another disk where it will be accessed. This also leads to a heavy network load if the devices fail.

Striping data comes with a disadvantage in that it is not possible to access the data on a storage drive by mounting it somewhere else or without a working CEPH cluster.

In conclusion, CEPH is the right solution if you are looking for object storage in your infrastructure. Due to its complexity, you have to expect less performance in comparison to DRBD which is only limited by your network speed. 

Daniel Kaltenböck on Email
Daniel Kaltenböck
Software Engineer at LINBIT HA Solutions GmbH
Daniel Kaltenböck studied technical computer science at the Vienna University of Technology. He is a software engineer by heart with a special focus on software defined storage.

RDMA Performance

As promised in the previous RDMA post, we gathered some performance data for the RDMA transport. Read and enjoy! Read more

What is RDMA, and why should we care?

DRBD9 has a new transport abstraction layer and it is designed for speed; apart from SSOCKS and TCP the next generation link will be RDMA.

So, what is RDMA, and how is it different from TCP? Read more

Benchmarking DRBD

We often see people on #drbd or on drbd-user trying to measure the performance of their setup. Here are a few best practices to do this.  Read more

DRBD and SSD: I was made for loving you

When DRBD 8.4.4 integrated TRIM/Discard support, a lot of things got much better… for example, 700MB/sec over a 1GBit/sec connection. Read more

DRBD Proxy 3.1: Performance improvements

The threading model in DRBD Proxy 3.1 received a complete overhaul; below you can see the performance implications of these changes. Read more