Linstor world record 14,8 million

IOPS World Record Broken – LINBIT Tops 14.8 million IOPS

In a performance test LINBIT measured 14.8 million IOPS on a 12 node cluster built from standard off-the-shelf Intel servers. This is the highest storage performance reached by a hyper-converged system on the market, for this hardware basis. Even a small LINBIT storage system can provide millions of IOPS at latencies of a fraction of a millisecond. For real-world applications, these figures correspond to outstanding application performance.

Test setup

LINBIT chose this setup because our competitors have published test results from equivalent systems. So it is easy to compare the strengths of each software offering with fair conditions and the same environment. We worked hard to get the most of the system and made it! Microsoft managed to reach 13.7 million IOPS, and Storpool marginally topped that with 13.8 million IOPS. We reached 14.8 million remote read IOPS – a significant jump of 7.2%! “Those performance numbers mark a milestone in the development of our software. The results prove we speed up High Availability at a large scale”, says CEO Philipp Reisner. The numbers would scale up even further with a larger setup.

Linstor world record 14,8 million

 

These exciting results are for 3-way synchronous replication using DRBD. The test cluster was provided through the Intel®️ Data Center Builders program. It consists of 12 servers, each running 8 instances of the benchmark, making a total of 96 instances. The setup is hyper-converged, meaning that the same servers are used to run the benchmark and to provide the underlying storage.

 

Linstor world record 14,8 million server

 

For some benchmarks, one of the storage replicas is a local replica on the same node as the benchmark workload itself. This is a particularly effective configuration for DRBD.

DRBD provides a standard Linux block device, so it can be used directly from the host, from a container, or from a virtual machine. For these benchmarks, the workload runs in a container, demonstrating the suitability of LINBIT’s SDS solution, which consists of DRBD and LINSTOR, for use with Kubernetes.

IOPS and bandwidth results are the totals from all 96 workload instances. Latency results are averaged.

Let’s look into the details!

Top performance with DRBD

5.0 million synchronously replicated write IOPS

This was achieved with a 4K random write benchmark with an IO depth of 64 for each workload. The setup uses Intel® Optane™ DC Persistent Memory to store the DRBD metadata. The writes are 3-way replicated with one local replica and two remote replicas. This means that the backing storage devices are writing at a total rate of 15 million IOPS.

85μs synchronously replicated write latency

This was achieved with a 4K random write benchmark with serial IO. That is, an IO depth of 1. This means that the writes were persisted to all 3 replicas within an average time of only 85μs. DRBD attained this level of performance both when one of the replicas was local and when all were remote. The setup also uses Intel® Optane™ DC Persistent Memory to store metadata.

14.8 million remote read IOPS

This was achieved with a 4K random read benchmark with an IO depth of 64. This corresponds to 80% of the total theoretical network bandwidth of 75GB/s. This result was reproduced without any usage of persistent memory so that the value can be compared with those from our competitors.

10.6 million IOPS with 70/30 mixed read/write

Representing a more typical real-world scenario, this benchmark consists of 70% reads and 30% writes and used an IO depth of 64. One of the 3 replicas was local.

YouTube

By loading the video, you agree to YouTube's privacy policy.
Learn more

Load video

Benefits of persistent memory

DRBD is optimized for persistent memory. When the DRBD metadata is stored on an NVDIMM, write performance is improved.

When the metadata is stored on the backing storage SSD with the data, DRBD can process 4.5 million write IOPS. This increases to 5.0 million when the metadata is stored on Intel® Optane™ DC Persistent Memory instead, an improvement of 10%.

Moving the metadata onto persistent memory has a particularly pronounced effect on the write latency. This metric plummets from 113μs to 85μs with this configuration change. That is, the average write is 25% faster.

Detailed results

Below are the full results for DRBD running on the 12 servers with a total of 96 benchmark workloads.

Benchmark name Without local replica With local replica
Random read

(higher is better)

14,800,000 IOPS 22,100,000 IOPS
Random read/write 70/30

(higher is better)

8,610,000 IOPS 10,600,000 IOPS
Random write

(higher is better)

4,370,000 IOPS 5,000,000 IOPS
Sequential read

(higher is better)

64300 MB/s 111000 MB/s
Sequential write

(higher is better)

20700 MB/s 23200 MB/s
Read latency

(lower is better)

129 μs 82 μs
Write latency

(lower is better)

85 μs 84 μs

The IOPS and MB/s values have been rounded down to 3 significant figures.

All volumes are 500GiB in size, giving a total active set of 48,000GiB and consuming a total of 144,000GiB of the underlying storage. The workloads are generated using the fio tool with the following parameters:

Benchmark type Block size IO depth Workload instances Total active IOs
Random 4K 64 96 6144
Sequential 128K 16 96 1536
Latency 4K 1 96 96

Quality controls

In order to ensure that the results are reliable, the following controls were applied:

  • The entire dataset was written after allocating the volumes, but before running the tests. This prevents artificially fast reads of unallocated blocks. When the backing device driver or firmware recognizes that an unallocated block is being read, it may simply return zeros without reading from the physical medium.
  • The benchmark uses direct IO to bypass the operating system cache and the working set was too large to be cached in memory in any case.
  • The tests were each run for 10 minutes. The metrics stabilized within a small proportion of this time.
  • The measurements were provided by the benchmarking tool itself, rather than being taken from a lower level such as the DRBD statistics. This ensures that the performance corresponds to that which a real application would experience.
  • The random pattern used for the benchmark used a random seed to avoid any bias due to the same blocks being chosen by subsequent test runs.

Software stack

The following key software components were used for these benchmarks:

  • Distribution: CentOS 8.0.1905
  • Kernel: Linux 4.18.0-80.11.2.el8_0.x86_64
  • LVM from distribution kernel
  • DRBD 9.0.21-1
  • Docker 19.03.5
  • Fio 3.7

Terminology

In this text and at LINBIT, in general, we use the expression 2 replicas to indicate that the data is stored on 2 storage devices. For these tests, there are 3 replicas, meaning that the data is stored on 3 storage devices.

In other contexts, the expression 2 replicas might mean one original plus 2 replicas. That would mean that data would be stored on 3 storage devices.

Test infrastructure

These results were obtained on a cluster of 12 servers made available as part of the Intel® Data Center Builders program. Each server was equipped with the following configuration:

  • Processor: 2x Intel® Xeon Platinum 8280L CPU
  • Memory: 384GiB DDR4 DRAM
  • Persistent memory: 4x 512GB Intel® Optane™ DC Persistent Memory
  • Storage: At least 4x Intel® SSD DC P4510 of at least 4TB
  • Network: Intel® Ethernet Network Adapter XXV710 with dual 25GbE ports

The servers were all connected in a simple star topology with a 25Gb switch.

YouTube

By loading the video, you agree to YouTube's privacy policy.
Learn more

Load video

Speed is of the essence

Storage has often been a bottleneck in modern IT environments. The two requirements speed and high availability have always been in competition. If you aim for maximum speed, the quality of the high availability tends to suffer and vice versa. But with this performance test we demonstrate the best-of-breed open source software-defined storage solution. A replicated storage system that combines high availability and the performance of local NVMe drives is now possible.

This technology enables any public and private cloud builder to deliver high performance for their applications, VMs and containers. If you aim to build a powerful private or public cloud, our solution meets your storage performance needs.

If you want to learn more or have any questions, do contact us at [email protected]

 

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.

 

drbd-performance-pmem3

Optimizing DRBD for Persistent Memory

Persistent Memory (PMEM) is an exciting storage tier with much lower latency than SSDs. LINBIT has optimized DRBD for when its metadata is stored on PMEM/NVDIMM.

This article relates to both:

  • Traditional NVDIMM-N: Some DRAM is accompanied by NAND-flash. On power failure, a backup power source (supercap, battery) is used to save the contents of the DRAM to the flash storage. When the main power is restored, the contents of the DRAM are restored. These components have exactly the same timing characteristics as DRAM and are available in sizes of 8GB, 16GB or 32GB per DIMM.
  • Intel’s new Optane DC Persistent Memory: These DIMMs are built using a new technology called 3D XPoint. It is inherently non-volatile and has only slightly higher access times than pure DRAM. It comes in much higher capacities than traditional NVDIMMs: 128GB, 256GB and 512GB.

DRBD requires metadata to keep track of which blocks are in sync with its peers. This consists of 2 main parts. One part is the bitmap which keeps track of exactly which 4KiB blocks may be out of sync. It is used when a peer is disconnected to minimize the amount of data that must be synced when the nodes reconnect. The other part is the activity log. This keeps track of which data regions have ongoing I/O or had I/O activity recently. It is used after a crash to ensure that the nodes are fully in sync. It consists of 4MiB extents which, by default, cover about 5GiB of the volume.

Since the DRBD metadata is small and frequently accessed, it is a perfect candidate to be placed on PMEM. A single 8GiB NVDIMM can store enough metadata for 100 volumes of 1TiB each, allowing for replication between 3 nodes.

PMEM outperforms

DRBD 9 has been optimized to access metadata on PMEM directly using memory operations. This approach is extremely efficient and leads to significant performance gains. The improvement is most dramatic when the metadata is most often updated. This occurs when writes are performed serially. That is, the I/O depth is 1. When this is the case, scattered I/O forces the activity log to be updated on every write. Here we compare the performance between metadata on a separate NVMe device, and metadata on PMEM with and without the optimizations.

As can be seen, placing the DRBD metadata on a PMEM device results in a massive performance boost for this kind of workload.

drbd-performance-pmem

Impact with concurrent I/O

When I/O is submitted concurrently, DRBD does not have to access the metadata as often. Hence we do not expect the performance impact to be quite as dramatic. Nevertheless, there is still a significant performance boost, as can be seen.

If you have a workload with very high I/O depth, you may wish to trial DRBD 10, which performs especially well in such a situation. See https://www.linbit.com/en/drbd10-vs-drbd9-performance/.

drbd-performance-pmem-io

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. Each server had an 8GiB DDR4 NVDIMM from Viking installed. DRBD 9.0.17 was used to perform the tests without the PMEM optimizations and DRBD 9.0.20 for the remainder. I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8

--ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1

--runtime=60 --size=48G --filename=/dev/drbd500

 

If you got technical questions, don’t hesitate to subscribe to our email list

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.
Grafana Dashboard

Monitoring Linux HA Clusters with Prometheus

You’ve likely heard of Prometheus, the open-source monitoring and alerting solution originally developed at SoundCloud. Prometheus was the second project incubated by the Cloud Native Computing Foundation in 2016 (the first being Kubernetes), and is used by companies like Google, Red Hat, and Digital Ocean as a scalable and efficient way to monitor infrastructure in a cloud-native way.  If this is the first you’ve heard of Prometheus, I would strongly recommend heading over to Prometheus.io to read through the introduction in their documentation.

ClusterLabs, the organization unifying the collection of open-source projects pertaining to Linux High Availability (HA) Clustering, recently added the ha_cluster_exporter project to the organization’s GitHub (Fall 2019). The ha_cluster_exporter project is a Prometheus exporter that exposes Linux HA Cluster metrics to Prometheus!

Linux HA Cluster with Prometheus

Linux HA Clustering is all about service availability and uptime, so you won’t have users informing you when a node in your cluster has died since they shouldn’t notice. Due to this, LINBIT has heard the following story a few times, “No one knew the Primary cluster node had failed until the Secondary node failed as well.” Ouch. Besides offering our condolences in what likely was a resume generating event (RGE) for someone, we usually can only suggest that they set up better monitoring and alerting of their cluster nodes via the software of their choice. However, after less than a day of playing with Prometheus and the ha_cluster_exporter, these tools may have just jumped to the top of my list of recommended monitoring stacks.

After running through Prometheus’ installation documentation, and compiling the ha_cluster_exporter as documented in the project’s README.md, I was quickly querying Prometheus for my Linux HA Cluster’s metrics:

  • How synchronized is DRBD (kicked off a full resync prior to capturing):
    DRBD synchronization graph
  • What’s the Pacemaker fail count on a given resource (p_vip_ip in this capture):
    Pacemaker resource fail count
  • Have the Corosync rings (communication networks) experienced any errors:
    Corosync ring errors
  • Adding Prometheus’ Node Exporter – an open-source exporter developed under the Prometheus GitHub org – to my Linux HA Cluster nodes enabled me to extract metrics from the Linux Kernel (writes to DRBD in MB/s):
    DRBD writes MB/s

Alertmanager in Prometheus

For a complete list of the metrics scraped by the ha_node_exporter, please see the ha_node_exporter’s documentation on GitHub. For a complete list of the metrics scraped by Prometheus’ Node Exporter, as well as how to build and install it, please see the Prometheus Node Exporter’s documentation on GitHub.

Using Prometheus to collect and query these metrics is a great first step, but alerting is probably the most important and often the most difficult to configure aspect of monitoring. Prometheus separates Alerting into a separate app called Alertmanager. You configure your alerting rules inside of Prometheus, and Prometheus sends its alerts to an Alertmanager instance. The Alertmanager then, based on the rules you’ve configured, deduplicates, groups, and routes them to the correct receiver, which could be sent out via email, Slack, PagerDuty, Opsgenie, HipChat, and more. The Prometheus Alertmanager configuration documentation has example configs for all of the aforementioned alert receivers.

I was able to quickly set up a Slack Webhook app in a Slack channel named #prometheus-slack, and add an alert rule to Prometheus instructing it to send an alert to Slack when the sum of all Pacemaker’s resources’ failcounts exceed zero for longer than five minutes. I created a resource failure in my Pacemaker cluster by removing the cluster managed virtual IP address from the active cluster node, and five minutes later – as configured – I received a message in Slack with a link to the alert specifics in Prometheus:

Integrating Grafana

The last – but probably the most satisfying – part of my testing was integrating Grafana into the monitoring stack for data visualization. Grafana is an open source project that enables you to query, collect, alert, and visualize data gathered from any of its data source plugins (many of which come bundled with Grafana), into a single place as easily understandable dashboards. Grafana, since version 2.5.0, natively includes Prometheus as one of its data source plugins, so integration is very easy.

In less than 30 minutes, I had set up what I thought was a nice little HA Cluster Dashboard using the same, or very similar, expressions as used when querying Prometheus directly. Grafana makes customizing the visualizations very easy through its web front end. For example the “Pacemaker Resource Failures” counter value in the HA Cluster Dashboard depicted below, will turn from green to red when the counter exceeds zero:

Grafana Dashboard

I’ve been very impressed by how easily all these different tools integrate with one another and encourage anyone wondering what they should do for monitoring to test it out. Even though this blog describes monitoring, alerting, and visualizing machine-centric metrics from Linux HA Cluster nodes, the ease of integration fits perfectly into the new world of microservices and cloud-native applications, so use cases and implementations are plentiful. For example, using Prometheus, and Prometheus’ Node Exporter on my Kubernetes worker nodes to expose metrics pertaining to dynamically provisioned LINSTOR volumes is a no-brainer for a future blog post.

If you’re already using Prometheus and the ha_cluster_exporter to monitor your Linux HA Clusters, let us know how in the comments or via email.

Matt Kereczman on Linkedin
Matt Kereczman
Matt is a Linux Cluster Engineer at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT’s support team, and plays an important role in making LINBIT’s support great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt’s hobbies.
ladybug coccinelle project

How to make DRBD compatible to the Linux kernel

The Linux kernel’s interface towards device drivers is not set in stone. It evolves with the evolvement of Linux. Sometimes driven by hardware improvements, sometimes driven by general evolution of the code base. In tree drivers this is not a big issue since they get changed as the interfaces are modified, therefore are called “tree wide changes”.

DRBD development happens out-of-tree before code gets sent upstream. We at LINBIT need to track these tree wide changes and follow them in the DRBD device driver. But, not all of our users are using the same kernel.

Some are strictly sticking to the kernel that was shipped with the distribution, running a kernel that is years behind Linus’ version of Linux.

That creates a problem. DRBD should be compatible with both: Years old, carefully maintained “Vendor kernels”, and the latest and greatest Linus kernel.

In order to do that we have a kernel compatibility layer in DRBD. It contains two main parts:

  1. Detecting the capabilities of the kernel we want to build against. The background is that a “Vendor kernel” is not just a random old Linus kernel. It starts as some release of the vanilla kernel, and then the vendors cherry-pick upstream changes and fixes they deem to be relevant to their vendor kernel.
  2. The compatibility layer itself. Up to DRBD-9.0.19 this was a huge file containing many #IFDEFs. It became a maintenance nightmare. It was hard to extend, hard to understand and debug, and hard to remove old compat. Everything was ugly.

Coccinelle is French for Ladybug

The Coccinelle project from INRIA contains a tool to apply semantic patches to the source code, or expand the semantic patches to conventional patches. A few of us DRBD developers practically fell into love with that tool. It allows us to express how some code that is compatible with the upstream kernel needs to be changed in order to be compatible with some older versions of the kernel. 

This allows our DRBD upstream code to be in a form that has clean Linux upstream code, containing no compatibility hacks.

This allows us to automatically transform DRBD to be compatible with random old kernels or vendor kernels. The result, after the transformation, is clean C code without confusing macros and #IFs. It is wonderful.

The new kernel compatibility mechanism:

  1. Detect kernel capabilities (as before)
  2. Create a compat patch using spatch (from Coccinelle)
  3. Apply the compat patch and compile DRBD

Where there is light there must be shadow

The spatch tool is not available on all Linux distributions. For a little older kernels we even require a very recent version of spatch, which is even less available. The researchers at INRIA write the tool in a programming language, “OCaml”, which is right for them and the challenge, but not familiar to many in the open source community.

This complex build dependency makes it harder for community members to build drbd-9.0.20 and higher compared to how it was before.

The shortcut through the maze

For a number of well known vendor kernels (RHEL/Centos, Ubuntu-LTS, SLES, Oracle linux, Amazon Linux) we include the complete compat patches in the distribution source tar.gz. Meaning, during the DRBD build process it executes all the COMPAT tests and calculates a hash value out of that.

If it finds a pre-generated compat.patch for that hash value, the build process can continue without a call to spatch! Complex build dependency avoided!

The hard route through the maze

When you are building from a GIT-checkout, or you are building for a kernel for which we did not include the pre-generated compat.patch you need spatch.

If necessary you can run step 2 (using spatch) on a different machine then approach step 1 (testing kernel capabilites) and step 3 (compiling the drbd kernel module).

Use ‘make’ to start the compilation process. If it fails just after this output:

 [...]

 COMPAT  sock_create_kern_has_five_parameters

 COMPAT  sock_ops_returns_addr_len

 COMPAT  submit_bio_has_2_params

 CHK     /home/phil/src/drbd09/drbd/compat.4.15.18.h

 UPD     /home/phil/src/drbd09/drbd/compat.4.15.18.h

 CHK     /home/phil/src/drbd09/drbd/compat.h

 UPD     /home/phil/src/drbd09/drbd/compat.h

 GENPATCHNAMES   4.15.0-48-generic

 SPATCH   27e10079afbff16b2b82fae9f7dbe676

 

Please take note of the hash value after “SPATCH”. That is like a fingerprint containing all the results of the countless “COMPAT” tests that were executed just before.

Then you need to copy the results of the COMPAT tests to machine/VM/container that has the same drbd source directory and a recent spatch.

 

rsync -rv drbd/drbd-kernel-compat/cocci_cache/27e10079afbff16b2b82fae9f7dbe676 \

 [email protected]:src/drbd-9.0.20/drbd/drbd-kernel-compat/cocci_cache/

 

Then you run the spatch part of the build process there:

 

ssh [email protected] "make -C src/drbd-9.0.20/drbd compat"

 

After that you copy the resulting compat.patch back:

 

rsync -rv \ [email protected]:src/drbd-9.0.20/drbd/drbd-kernel-compat/cocci_cache/ \

  drbd/drbd-kernel-compat/cocci_cache/

 

Call ‘make’ to restart the build process. If you did it right, it will find the generated compat.patch and finish the compilation process.

Get a Ladybug

If you’d like to get a spatch that is recent enough for building the DRBD driver, use a docker container we published on dockerhub https://hub.docker.com/r/linbit/coccinelle.

 

docker pull linbit/coccinelle

 

Then put the following shell script under the name ‘spatch’ into your $PATH.

 

#!/bin/bash

docker run -it --rm -v "$PWD:$PWD" -w "$PWD" coccinelle:latest spatch "[email protected]"

DRBD compatible to the Linux kernel

All of this is great for making the code more readable, easier to understand, and more likely to contain less bugs. And, having the DRBD code, without backward compatibility clutter, is an important milestone on the path to getting DRBD-9 into Linus’ vanilla kernel and replacing drbd-8.4 with drbd-9 from there.

 

Philipp Reisner on Linkedin
Philipp Reisner
Philipp Reisner is founder and CEO of LINBIT in Vienna/Austria. His professional career has been dominated by developing DRBD, a storage replication for Linux. Today he leads a company of about 30 employees with locations in Vienna/Austria and Portland/Oregon.
Performance drbd10 drbd9

Performance Gains with DRBD 10

A key factor in evaluating storage systems is their performance. LINBIT has been working to further improve the performance of DRBD. The recent DRBD 10 alpha release demonstrates significant gains.

The performance gains particularly help with highly concurrent workloads. This is an area that has been steadily rising in importance, and looks set to continue to rise. Improvements in single core speed appear to be stagnating while the availability of ever increasing numbers of cores is growing. Hence software systems need to utilize concurrency effectively to make the most of the computing resources.

We tested DRBD 10 with 4K random writes and various concurrency levels. In this test, the data is being replicated synchronously (“protocol C”) between two nodes. These numbers are for a single volume, not an aggregate over many volumes. I/O was generated by 8 processes. The tests show improvements in raw random write performance of up to 68%.

drbd10 performance gains

These improvements were achieved by using a finer-grained locking scheme. This allows, for instance, one core to be sending a request while a second core is submitting the next request. The result is better utilization of the available cores and overall higher throughput.

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. The software versions used were DRBD 10.0.0a1 and its most recent ancestor from the DRBD 9 branch (8e93a5d93b62). I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8 --ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1 --runtime=60 --size=48G --filename=/dev/drbd500

Ongoing development on DRBD 10

LINBIT is working on a number of exciting major features for DRBD 10.

  • Request forwarding. DRBD will send data to geographically distant sites only once and it will be replicated there.
  • PMEM journaling. DRBD can already access its metadata in a PMEM optimized fashion. That will be extended to using a PMEM device as a write-back cache, resulting in improved performance in latency-sensitive scenarios.
  • Erasure coding. DRBD will be able to erasure code and distribute its data. This provides the same functionality as RAID5/6, but with an arbitrary number of parity nodes. The result is lower disk usage with similar redundancy characteristics.

Stable releases of DRBD 10 are planned for 2020 – until then stay tuned for upcoming updates!

 

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.

LINSTOR LDAP Authentication

New Features of LINSTOR Release – July 2019

The Newest LINSTOR release (July 2019) came with a bunch of new features, and one that is really worth highlighting:

The developers of LINSTOR, the storage management tool for all things Linux, announced that the latest release comes with authentication for LDAP. Software-defined storage consumers were demanding privilege authentication, so we set this as a priority in July.

With support for basic LDAP authentication, you can configure an LDAP server and a search_filter to allow only members of a certain group access to LINSTOR. To accomplish this, here’s a sample configuration entry

 [ldap]

  enabled = true

  uri = "ldaps://ldap.example.com"

  dn = "uid={user},ou=users,o=ha,dc=example"

  search_base = "dc=example"

  search_filter =

"(&(uid={user})(memberof=cn=linstor,ou=services,o=ha,dc=example))"

```

The `{user}` template variable will be replaced with the login user.

Please note that LINSTOR must be configured with HTTPS in order to configure LDAP authentication. 

Now you can securely manage privileges of your storage clusters, so the antics of those pesky interns don’t keep you awake at night.

 

Greg Eckert on Linkedin
Greg Eckert
In his role as the Director of Business Development for LINBIT America and Australia, Greg is responsible for building international relations, both in terms of technology and business collaboration. Since 2013, Greg has connected potential technology partners, collaborated with businesses in new territories, and explored opportunities for new joint ventures.

LINBIT Announces DRBD Support for Amazon Linux

The Digital Transformation

The concept of “Digital Transformation” for executive teams at Fortune-sized companies is no longer a new and flashy phrase. An important part of this Digital Transformation is how companies think about cloud computing. When once, organizations seemed to have only 2 decisions: enter the cloud, or keep everything on premise; now the options are a bit more “cloudy” (pun intended).

In the digital transformation age, Fortune companies are looking at multi-cloud strategies. They understand that siloing data into one cloud provider decreases their flexibility and ability to negotiate discounts while increasing the risks of a provider outage affecting production workloads. When Fortune 1000 companies think about their multi-cloud strategies they basically have 3 options:

  1. Keep some data on-prem and put some in the cloud
  2. Put data in different regions or zones within a single cloud provider
  3. Place data in many separate cloud providers

What’s great about all three is that companies can be dynamic about how they solve business goals, allocate budget, and provision resources. With this multi-cloud shift, some of the traditional technologies used in businesses need to adapt and change.

One of our Fortune 500 clients who develops financial software and sells financial, accounting, and tax preparation software, came to us because they were switching an OS installation from RHEL to Amazon Linux. Clearly, they are deep into their Digital Transformation journey because though this workload was already in the cloud. Changing both the OS and automation toolchain of a cloud deployment this large is no easy feat.

As a small team, we pride ourselves in jumping high at client requests, and therefore within two weeks the work was done. The answer is “Yes. LINBIT now supports DRBD 9.0 on Amazon Linux.” As client demand changes, as workloads migrate to the cloud, and as containers gain traction, we are doing our best to be dynamic by listening to community & client feedback. 

With millions of downloads, we rely on clients and the open-source community users to tell us what they want. If you haven’t been following our progress, this means we are thinking about how to improve performance for Linux High Availability Clusters and Disaster Recovery clusters for traditional workloads on hardware like NVMe and Optane, while also looking into kernel technology’s role in Kubernetes environments in conjunction with public and private cloud environments. What challenges exist here that didn’t before? What do users want? These are the questions that drive our development.

So DRBD users: we’re here. We’re listening, feel free to chime in on the community IRC (#drbd on freenode) & mailing list forums, respond in the comments here, ask questions about our Youtube videos… & lets ensure that open-source continues to drive innovation as the commercial giants are deciding which technologies to choose in their 5 year technology goals.

Greg Eckert on Linkedin
Greg Eckert
In his role as the Director of Business Development for LINBIT America and Australia, Greg is responsible for building international relations, both in terms of technology and business collaboration. Since 2013, Greg has connected potential technology partners, collaborated with businesses in new territories, and explored opportunities for new joint ventures.

Kubernetes Operator: LINSTOR’s Little Helper

Before we describe what our LINSTOR Operator does, it is a good idea to discuss what a Kubernetes Operator actually is. If you are already familiar with Kubernetes Operators, feel free to skip the introduction.

Introduction

CoreOS describes Operators like this:

An Operator is a method of packaging, deploying and managing a Kubernetes application. A Kubernetes application is an application that is both deployed on Kubernetes and managed using the Kubernetes APIs and  kubectl tooling.

That is quite a lot to grasp if you are new to the concept of Operators. What users want to do is their work, without having to worry about setting up infrastructure. Still, there is more to a software lifecycle than just firing up the application in a cluster once. This is where an Operator comes into play. In my opinion, a good analogy is to think of a Kubernetes Operator as an actual human operator. So what would the responsibilities of such a human operator be?

A human operator would be an expert in the business logic of the software that she runs and the dependencies that need to be fulfilled to run the software. Additionally, the operator would be responsible for configuring the software: to scale it, to upgrade it, to make backups, and so on. This is also the responsibility of a Kubernetes Operator implemented in software. It is a software component that is built by experts for a particular containerized software. This Operator is executed by an administrator who isn’t necessarily an expert for that particular software.

A Kubernetes Operator is executed in the Kubernetes cluster itself. In contrast to shell scripts or Ansbile playbooks, which are pretty generic, the Operator framework has one specific purpose, as well as access to Kubernetes cluster information. Additionally, they are managed by standard Kubernetes tools and not some external tools for configuration management. An Operator has a managed lifecycle on its own and is managed by the lifecycle manager.

LINSTOR Operator

CoreOS FAQ contains the following sentence:

Experience has shown that the creation of an Operator typically starts by automating an application’s installation and self-service provisioning capabilities, and then evolves to take on more complex automation.

This pretty much sums up the current state of the linstor-operator. The project is still very young and the current focus was on automating the setup of the LINSTOR cluster. If you are familiar with LINSTOR, you know that there is a central component named the LINSTOR controller, and workers — named LINSTOR satellites — that actually create LVM volumes and configure DRBD to provide data redundancy.

Currently, the Operator can add new nodes/kubelets to the LINSTOR cluster by registering them with their name and network interface configuration by the LINSTOR controller. A second important task is that a LINSTOR satellite can actually provide storage to the cluster. Therefore, the linstor-operator can also register one or multiple storage pools. Metrics of the LINSTOR satellite set’s storage pools are exposed by the operator. Therefore the system administrator saves a lot of time, because the LINSTOR Kubernetes Operator automates a lot of standard tasks, which were previously performed manually.

Future work

There is a lot of possible future work that can be done. Some tasks are obvious, some will be driven by actual input of our users. For example we think that configuring storage is one of the pain points for our uses. Therefore, having a sidecar container that can discover and prepare storage pools to be consumed by LINSTOR might be a good idea. In a dynamic environment such as Kubernetes it might be worthwhile to handle nodes failures in clever ways. We also have container images that can inject the DRBD kernel module into the running host kernel, which could help users with getting started with DRBD. High Availability is always an important topic and related to that using etcd as database backend. Further, we want to tackle one of the core tasks of Operators, which is managing upgrades from one LINSTOR controller version to the next.

Thanks to Hayley for reviewing this blog post, she is too busy doing the actual work. This is just a subset of the capabilities we plan for the linstor-operator. Stay tuned for more information!

 

Roland Kammerer
Software Engineer at Linbit
Roland Kammerer studied technical computer science at the Vienna University of Technology and graduated with distinction. Currently, he is a PhD candidate with a research focus on time-triggered realtime-systems and works for LINBIT in the DRBD development team.

NuoDB on LINSTOR-Provisioned Kubernetes Volumes

Introduction:

NuoDB and LINBIT put our technologies together to see just how well they performed, and we are both happy with the results. We decided to compare LINSTOR provisioned volumes against Kubernetes Hostpath (Direct Attached Storage) in a Kubernetes cluster hosted in Google’s cloud platform (GCP) to show that our on-prem testing results can also be proven in a popular cloud-computing environment.

Background:

NuoDB is an ANSI SQL standard and ACID transactional compliant container-native distributed OLTP database that provides responsive scalability and continuous availability. This makes it a great choice for your distributed applications running in cloud provider-managed and open source Kubernetes environments, such as GKE, EKS, AKS and Red Hat OpenShift. As you scale-out NuoDB Transaction Engines in your cluster, you’re scaling out the database’s capacity to process dramatically more SQL transactions per second, and at the same time, building in process redundancy to ensure the database — and applications — are always on.

As you scale your database, you also need to scale the storage that the database is using to persist its data. This is usually where things get sluggish. Highly-scalable storage isn’t always highly-performant, and it seems most of the time the opposite is true. Highly-scalable, highly-performant storage is the niche that LINSTOR aims to fill.

LINSTOR, LINBIT’s SDS software, can be used to deploy DRBD devices in large scale storage clusters. DRBD devices are expected to be about as fast as the backing disk they were carved from, or as fast as the network device DRBD is replicating over (if DRBD’s replication is enabled). At LINBIT we usually aim for a performance impact of less than 5% when using DRBD replication in synchronous mode.

The LINSTOR CSI (container storage interface) driver for Kubernetes allows you to dynamically provision LINSTOR provisioned block devices as persistent volumes for your container workloads… you see where I’m going… 🙂

Testing:

I spun up a 3-node GKE (Google’s Kubernetes Engine) cluster in GCP, and customized the standard node type with 6-vCPU and 22GB of memory for each node:

gke-nodes

When using GKE to spin up a Kubernetes cluster, you’re provided with a “standard” storage class by default. This “standard” storage class dynamically provisions and attaches GCE standard disks to your containers that need persistent volumes. Those GCE standard disks are the pseudo “hostpath” device we wanted to compare against, so we deployed NuoDB into the cluster, and ran a YCSB (Yahoo Cloud Serving Benchmark) SQL workload against it to generate our baseline:

gke-insights-overview

Using the NuoDB Insights visual monitoring tool (comes as standard equipment with NuoDB), we can see in the chart above we had 3 TEs (Transaction Engine) pods feeding into 1 SM (Storage Manager) pod. We can also see that our Aggregate Transaction Rate (TPS) is hovering just over 15K transactions per second. Also, as a side note, this deployment created 5-GCE Standard disks in my Google Cloud Engine account.

LINSTOR provisions its storage from an established LINSTOR cluster, so for our LINSTOR comparison, I had to stand up Kubernetes on GCE nodes “the hard way” so I could also stand up a LINSTOR cluster on the nodes (see LINBIT’s user’s guide or LINSTOR quickstart for more on those steps). I created 4-nodes as VM Instances in Kubernetes. 3-nodes were setup to mimic the GKE cluster, each with 6-vCPU and 22GB of memory, and our 1-master-node – with master node taint in Kubernetes so we will not schedule pods on this node – with 2-vCPU and 16GB of memory. Google recommended I scale these nodes back to save money, so I did that, resulting in the following VM instances:

gce-nodes

After setting up the LINSTOR and Kubernetes cluster in the GCE VM Instances, I attached a single “standard” GCE disk to each node for LINSTOR to provision persistent volumes from, and deployed the same NuoDB distributed database stack and YCSB workload into the cluster:

gce-insights-overview

After letting the benchmarks run for some time, I could see that we were hovering just under 15k, which is within the expected 5% of our ~15k baseline!

Conclusion:

You might be thinking, “That’s good and all, but why not just use GKE with the GCE-backed ‘standard’ storage class?” The answer is features. Using LINSTOR to provide storage to your container platform enables you to:

  • Add replicas of volumes for resiliency at the storage layer – including remote replicas
  • Use replicas of your volumes in DRBD’s read balancing policies which could increase your read speeds beyond what’s possible from a single volume
  • Provide granular control of snapshots at either the Kubernetes or LINSTOR-level
  • Provide the ability to clone volumes from snapshots
  • Enable transparently encrypted volumes
  • Provide data-locality or accessibility policies
  • Lower managerial overhead in terms of the number of physical disks (comparing one GCE disk for each PV with GKE vs. one GCE disk for each storage node with LINSTOR).

Ultimately, the combination of NuoDB and LINSTOR enables clients to run high-performance persistent databases in the cloud or on premise with ease-of scale and “always-on” resiliency. So far, after testing both proprietary and open-source software, NuoDB has found that LINSTOR’s open-source SDS is a production-ready, high-performance, and highly reliable storage solution to provision persistent volumes.

 

Matt Kereczman on Linkedin
Matt Kereczman
Matt is a Linux Cluster Engineer at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT’s support team, and plays an important role in making LINBIT’s support great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt’s hobbies.

Coming Soon, a New DRBD Proxy Release

One major part of LINBIT Disaster Recovery is DRBD Proxy, which helps DRBD with long-distance real-time replication. DRBD Proxy mitigates bandwidth, latency, and distance issues by buffering writes into memory, ensuring that your WAN latency doesn’t become your disk throughput.

The upcoming release of the DRBD Proxy will come with a few new tools to improve data replication with compression. Its LZ4 plugin has been updated to the latest version 1.9.0 and Zstandard algorithm has been added as a brand-new plugin.

Both offer great balance of compression ratio and speed while offering higher replication performance on the DRBD end. In our test cases, both performed considerably better than without compression in overall read and write operations.

Here’s a short synopsis of some of the tests we ran. For this setup, we built a two-node DRBD cluster that was geographically separated. Both ran the latest yet-to-be-released version of the DRBD Proxy for various IO tests. The compression level for the Zstandard was 3, which is the default level between 0 and 22. LZ4 was set to level 9, which is the maximum level.

MySQL Read Write Operations with Sysbench

In this scenario, we used sysbench to perform random reads and writes to a MySQL database replicated on both nodes with DRBD Proxy and DRBD. Sysbench created a random database mounted on a 200MB DRBD volume with Thin LVM backing. Then it performed random transactions for 100 seconds.

The improved number of writes and overall transactions with compression is pretty clear compared to the ‘Proxy Only’ numbers. Interestingly, LZ4 and Zstandard both performed quite similarly.

MySQL Average Latency on MySQL RW Tests

The average latency from the same MySQL tests showed another interesting fact. When using DRBD Proxy, DRBD uses protocol A, which is an asynchronous mode. This setup in the test performed quite nicely compared to replicating with protocol C, the default synchronous mode. All three proxy configurations, regardless of the compression, performed very well against synchronous mode. The different modes of DRBD transport are explained here.

Other random IO tests performed with sysbench on the file system as well as fio tests at the block level mirrored the results shown above, where compression with proxy helped greatly with reducing the network payload while increasing overall read/write performance.

This was a quick preview of the upcoming DRBD Proxy release highlighting its compression plugins. Please stay tuned for the release announcement, and for any questions or comments, feel free to reach me in the comments below.

P.S. The test nodes were configured relatively light. The local node was a 4-core VM with 1GB of RAM running Ubuntu 18.04 and DRBD v9.0.18. The remote node was a 4-core VM with 4GB of RAM also running the same OS and DRBD. The WAN link was restricted to 2MB/s. The relevant sysbench commands used were:

sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-db=sbtest --db-driver=mysql --tables=10 --table-size=1000 prepare
sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-db=sbtest --db-driver=mysql --tables=10 --table-size=1000 --report-interval=10 --threads=4 --time=100 run
Woojay Poynter
IO Plumber
Woojay is working on data replication and software-defined-storage with LINSTOR, built on DRBD @LINBIT. He has worked on web development, embedded firmwares, professional culinary education, power carving with ice and wood. He is a proud father and likes to play with legos.