drbd-performance-pmem3

Optimizing DRBD for Persistent Memory

Persistent Memory (PMEM) is an exciting storage tier with much lower latency than SSDs. LINBIT has optimized DRBD for when its metadata is stored on PMEM/NVDIMM.

This article relates to both:

  • Traditional NVDIMM-N: Some DRAM is accompanied by NAND-flash. On power failure, a backup power source (supercap, battery) is used to save the contents of the DRAM to the flash storage. When the main power is restored, the contents of the DRAM are restored. These components have exactly the same timing characteristics as DRAM and are available in sizes of 8GB, 16GB or 32GB per DIMM.
  • Intel’s new Optane DC Persistent Memory: These DIMMs are built using a new technology called 3D XPoint. It is inherently non-volatile and has only slightly higher access times than pure DRAM. It comes in much higher capacities than traditional NVDIMMs: 128GB, 256GB and 512GB.

DRBD requires metadata to keep track of which blocks are in sync with its peers. This consists of 2 main parts. One part is the bitmap which keeps track of exactly which 4KiB blocks may be out of sync. It is used when a peer is disconnected to minimize the amount of data that must be synced when the nodes reconnect. The other part is the activity log. This keeps track of which data regions have ongoing I/O or had I/O activity recently. It is used after a crash to ensure that the nodes are fully in sync. It consists of 4MiB extents which, by default, cover about 5GiB of the volume.

Since the DRBD metadata is small and frequently accessed, it is a perfect candidate to be placed on PMEM. A single 8GiB NVDIMM can store enough metadata for 100 volumes of 1TiB each, allowing for replication between 3 nodes.

PMEM outperforms

DRBD 9 has been optimized to access metadata on PMEM directly using memory operations. This approach is extremely efficient and leads to significant performance gains. The improvement is most dramatic when the metadata is most often updated. This occurs when writes are performed serially. That is, the I/O depth is 1. When this is the case, scattered I/O forces the activity log to be updated on every write. Here we compare the performance between metadata on a separate NVMe device, and metadata on PMEM with and without the optimizations.

As can be seen, placing the DRBD metadata on a PMEM device results in a massive performance boost for this kind of workload.

drbd-performance-pmem

Impact with concurrent I/O

When I/O is submitted concurrently, DRBD does not have to access the metadata as often. Hence we do not expect the performance impact to be quite as dramatic. Nevertheless, there is still a significant performance boost, as can be seen.

If you have a workload with very high I/O depth, you may wish to trial DRBD 10, which performs especially well in such a situation. See https://www.linbit.com/en/drbd10-vs-drbd9-performance/.

drbd-performance-pmem-io

Technical details

The above tests were carried out on a pair of 16 core servers equipped with NVMe storage and a direct ethernet connection. Each server had an 8GiB DDR4 NVDIMM from Viking installed. DRBD 9.0.17 was used to perform the tests without the PMEM optimizations and DRBD 9.0.20 for the remainder. I/O was generated using the fio tool with the following parameters:

fio --name=test --rw=randwrite --direct=1 --numjobs=8

--ioengine=libaio --iodepth=$IODEPTH --bs=4k --time_based=1

--runtime=60 --size=48G --filename=/dev/drbd500

 

If you got technical questions, don’t hesitate to subscribe to our email list

Joel Colledge on Linkedin
Joel Colledge
Joel is a software developer at LINBIT with a background in mathematics. A polyglot programmer, Joel enjoys working with many different languages and technologies. At LINBIT, he has been involved in the development of LINSTOR and DRBD. Originally from England, Joel is now based in Vienna, Austria.
Grafana Dashboard

Monitoring Linux HA Clusters with Prometheus

You’ve likely heard of Prometheus, the open-source monitoring and alerting solution originally developed at SoundCloud. Prometheus was the second project incubated by the Cloud Native Computing Foundation in 2016 (the first being Kubernetes), and is used by companies like Google, Red Hat, and Digital Ocean as a scalable and efficient way to monitor infrastructure in a cloud-native way.  If this is the first you’ve heard of Prometheus, I would strongly recommend heading over to Prometheus.io to read through the introduction in their documentation.

ClusterLabs, the organization unifying the collection of open-source projects pertaining to Linux High Availability (HA) Clustering, recently added the ha_cluster_exporter project to the organization’s GitHub (Fall 2019). The ha_cluster_exporter project is a Prometheus exporter that exposes Linux HA Cluster metrics to Prometheus!

Linux HA Cluster with Prometheus

Linux HA Clustering is all about service availability and uptime, so you won’t have users informing you when a node in your cluster has died since they shouldn’t notice. Due to this, LINBIT has heard the following story a few times, “No one knew the Primary cluster node had failed until the Secondary node failed as well.” Ouch. Besides offering our condolences in what likely was a resume generating event (RGE) for someone, we usually can only suggest that they set up better monitoring and alerting of their cluster nodes via the software of their choice. However, after less than a day of playing with Prometheus and the ha_cluster_exporter, these tools may have just jumped to the top of my list of recommended monitoring stacks.

After running through Prometheus’ installation documentation, and compiling the ha_cluster_exporter as documented in the project’s README.md, I was quickly querying Prometheus for my Linux HA Cluster’s metrics:

  • How synchronized is DRBD (kicked off a full resync prior to capturing):
    DRBD synchronization graph
  • What’s the Pacemaker fail count on a given resource (p_vip_ip in this capture):
    Pacemaker resource fail count
  • Have the Corosync rings (communication networks) experienced any errors:
    Corosync ring errors
  • Adding Prometheus’ Node Exporter – an open-source exporter developed under the Prometheus GitHub org – to my Linux HA Cluster nodes enabled me to extract metrics from the Linux Kernel (writes to DRBD in MB/s):
    DRBD writes MB/s

Alertmanager in Prometheus

For a complete list of the metrics scraped by the ha_node_exporter, please see the ha_node_exporter’s documentation on GitHub. For a complete list of the metrics scraped by Prometheus’ Node Exporter, as well as how to build and install it, please see the Prometheus Node Exporter’s documentation on GitHub.

Using Prometheus to collect and query these metrics is a great first step, but alerting is probably the most important and often the most difficult to configure aspect of monitoring. Prometheus separates Alerting into a separate app called Alertmanager. You configure your alerting rules inside of Prometheus, and Prometheus sends its alerts to an Alertmanager instance. The Alertmanager then, based on the rules you’ve configured, deduplicates, groups, and routes them to the correct receiver, which could be sent out via email, Slack, PagerDuty, Opsgenie, HipChat, and more. The Prometheus Alertmanager configuration documentation has example configs for all of the aforementioned alert receivers.

I was able to quickly set up a Slack Webhook app in a Slack channel named #prometheus-slack, and add an alert rule to Prometheus instructing it to send an alert to Slack when the sum of all Pacemaker’s resources’ failcounts exceed zero for longer than five minutes. I created a resource failure in my Pacemaker cluster by removing the cluster managed virtual IP address from the active cluster node, and five minutes later – as configured – I received a message in Slack with a link to the alert specifics in Prometheus:

Integrating Grafana

The last – but probably the most satisfying – part of my testing was integrating Grafana into the monitoring stack for data visualization. Grafana is an open source project that enables you to query, collect, alert, and visualize data gathered from any of its data source plugins (many of which come bundled with Grafana), into a single place as easily understandable dashboards. Grafana, since version 2.5.0, natively includes Prometheus as one of its data source plugins, so integration is very easy.

In less than 30 minutes, I had set up what I thought was a nice little HA Cluster Dashboard using the same, or very similar, expressions as used when querying Prometheus directly. Grafana makes customizing the visualizations very easy through its web front end. For example the “Pacemaker Resource Failures” counter value in the HA Cluster Dashboard depicted below, will turn from green to red when the counter exceeds zero:

Grafana Dashboard

I’ve been very impressed by how easily all these different tools integrate with one another and encourage anyone wondering what they should do for monitoring to test it out. Even though this blog describes monitoring, alerting, and visualizing machine-centric metrics from Linux HA Cluster nodes, the ease of integration fits perfectly into the new world of microservices and cloud-native applications, so use cases and implementations are plentiful. For example, using Prometheus, and Prometheus’ Node Exporter on my Kubernetes worker nodes to expose metrics pertaining to dynamically provisioned LINSTOR volumes is a no-brainer for a future blog post.

If you’re already using Prometheus and the ha_cluster_exporter to monitor your Linux HA Clusters, let us know how in the comments or via email.

Matt Kereczman on Linkedin
Matt Kereczman
Matt is a Linux Cluster Engineer at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT’s support team, and plays an important role in making LINBIT’s support great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt’s hobbies.
Piraeus datastore - port

LINBIT announces Piraeus Datastore – Software-Defined Storage (SDS) for Kubernetes

Piraeus Datastore offers high performance, highly reliable SDS solution for Persistent Volumes in Kubernetes

 

San Diego, November 18., 2019 – LINBIT, the inventor of the open-source software DRBD™ and leader in Linux storage software, has announced the project “Piraeus Datastore”. Piraeus Datastore offers a fast, stable way for users to provide Persistent Volumes for their Kubernetes applications. It is developed under the open-source development model and has been made publicly available via pre-built packages and containers in a joint-effort between LINBIT, Daocloud and the open-source community. 

 

Serving the rapidly growing containerized applications market, Piraeus Datastore fills a significant gap in the market by providing a stable Software-Defined Storage solution for Kubernetes applications requiring high-performance block storage. “When it comes to storage, Piraeus helps clients achieve better system availability than competitors in the space can offer ”, says  Philipp Reisner, CEO of LINBIT. “Most Kubernetes storage newcomers combined the data and control plane to push out a minimum viable product. By separating these components out, we ensure that controller failure doesn’t impact storage system availability.” 

 

DaoCloud has been a leading CloudNative computing vendor in China since 2014. They have extensive experience of Kubernetes production with Fortune Global 500 customers in manufacturing and finance industries like SAIC, Haier and SPDB bank. They believe that Piraeus fits their clients’ needs for a cloud-native, container attached storage that provides both reliability and performance. Roby Chen, CEO of DaoCloud, says: “It is very exciting that we have the Piraeus project to elevate DRBD technology into the cloud-native arena, where we believe it can play a key role.” 

Kubernetes is generating more and more buzz

Piraeus Datastore provides data persistence for elastic applications, which dynamically create or remove containers depending on the load. A recent survey of 390 IT professionals showed that 51% of participants acknowledge there has been an increase in Kubernetes adoption in the last six months. And 86% of respondents say they have now adopted Kubernetes. The number went up from 57% a half a year ago. Storage remains a requirement for enterprises that are beginning to move their applications over to Kubernetes and the Piraeus datastore solution is perfect for clients who need the combination of reliability and performance.

 

With Piraeus Datastore, two key Open Source technologies are packaged in a way that is easily accessible and consumable for Kubernetes users. They bring the highest performance by leveraging on the proven DRBD technology and real-world operable clusters by the way LINSTOR™ separates control and data paths.

Piraeus Datastore makes proven technologies cloud-native. DRBD is under development, improvements and optimizations for 19 Years. LINSTOR stands out in the field of SDS systems by having a separate and independent control plane that is independent from the data plane. This big advantage of this separation is it makes upgrades in a running storage cluster feasible, which saves downtime and therefore a lot of money. 

Piraeus Datastore is perfectly suited for databases, AI and analytics workloads, where the throughput and latency of primary storage is required.

 

LINBIT SDS™ is the industry’s fastest software-defined storage solution for enterprise, cloud, and container environments. LINBIT SDS leverages the DRBD™ and LINSTOR™ technology to provision, replicate, and manage data storage: independent of underlying hardware. The Piraeus containers intended for adoption by the community, while the LINBIT SDS containers are intended for consumption by corporate users. The LINBIT SDS product comes with enterprise support options, while Piraeus is supported by the community.

 

For a close comparison check out the following graphic:

 

Piraeus_logoPiraeus Datastore linbit_sdsLINBIT SDS
Container base image Debian_logodebian UBI
Pre-built available Publicly on Dockerhub, Quay For LINBIT customers drbd.io
Support community only ✅ enterprise, incl 24/7
Runs with OpenShift  without DRBD
OpenShift certified n.a.  
Kernel-module compile from source compile & pre-compiled
Contains DRBD logo, piraeus-operator, linstor-csi and

Linstor_logo

Licensing Open source software, GPL & Apache
Developed and verified for kubernetes container orchestration and redhat primed openshift
Platforms on the roadmap ibm cloud containerSuse Caas

 

Learn more:

 

About Linbit

 

LINBIT is the force behind DRBD and a leader in open-source Linux block storage software for enterprise and cloud computing. The LINBIT software has helped dozens of global companies such as Volkswagen, Intel, Cisco, Siemens, BBC to provide High Availability (HA), Geo Clustering for Disaster Recovery (DR), and Software-Defined Storage (SDS) for public and private clouds. Based in Vienna LINBIT partners with other companies like Redhat, Intel, IBM or DaoCloud to accelerate Linux storage software. For more information, visit linbit.com or follow @linbit

 

Social Media Channels:

 

LINBIT on Twitter

LINBIT on Linkedin

LINBIT on Youtube

LINBIT on Facebook

 

About DaoCloud

 

DaoCloud is a leading CloudNative computing vendor in China since 2014. They have extensive experience of Kubernetes production with Fortune Global 500 customers in manufacturing and finance industries like SAIC, Haier and SPDB bank. 

 

PR contact:

 

Sebastian Schinhammer

Marketing Manager

sebastian.schinhammer(@)linbit.com

Phone: 0043 1 817 82 92 -64

LINBIT HA-Solutions GmbH

Vivenotgasse 48

1120 Wien