Grafana Dashboard

Monitoring Linux HA Clusters with Prometheus

You’ve likely heard of Prometheus, the open-source monitoring and alerting solution originally developed at SoundCloud. Prometheus was the second project incubated by the Cloud Native Computing Foundation in 2016 (the first being Kubernetes), and is used by companies like Google, Red Hat, and Digital Ocean as a scalable and efficient way to monitor infrastructure in a cloud-native way.  If this is the first you’ve heard of Prometheus, I would strongly recommend heading over to to read through the introduction in their documentation.

ClusterLabs, the organization unifying the collection of open-source projects pertaining to Linux High Availability (HA) Clustering, recently added the ha_cluster_exporter project to the organization’s GitHub (Fall 2019). The ha_cluster_exporter project is a Prometheus exporter that exposes Linux HA Cluster metrics to Prometheus!

Linux HA Cluster with Prometheus

Linux HA Clustering is all about service availability and uptime, so you won’t have users informing you when a node in your cluster has died since they shouldn’t notice. Due to this, LINBIT has heard the following story a few times, “No one knew the Primary cluster node had failed until the Secondary node failed as well.” Ouch. Besides offering our condolences in what likely was a resume generating event (RGE) for someone, we usually can only suggest that they set up better monitoring and alerting of their cluster nodes via the software of their choice. However, after less than a day of playing with Prometheus and the ha_cluster_exporter, these tools may have just jumped to the top of my list of recommended monitoring stacks.

After running through Prometheus’ installation documentation, and compiling the ha_cluster_exporter as documented in the project’s, I was quickly querying Prometheus for my Linux HA Cluster’s metrics:

  • How synchronized is DRBD (kicked off a full resync prior to capturing):
    DRBD synchronization graph
  • What’s the Pacemaker fail count on a given resource (p_vip_ip in this capture):
    Pacemaker resource fail count
  • Have the Corosync rings (communication networks) experienced any errors:
    Corosync ring errors
  • Adding Prometheus’ Node Exporter – an open-source exporter developed under the Prometheus GitHub org – to my Linux HA Cluster nodes enabled me to extract metrics from the Linux Kernel (writes to DRBD in MB/s):
    DRBD writes MB/s

Alertmanager in Prometheus

For a complete list of the metrics scraped by the ha_node_exporter, please see the ha_node_exporter’s documentation on GitHub. For a complete list of the metrics scraped by Prometheus’ Node Exporter, as well as how to build and install it, please see the Prometheus Node Exporter’s documentation on GitHub.

Using Prometheus to collect and query these metrics is a great first step, but alerting is probably the most important and often the most difficult to configure aspect of monitoring. Prometheus separates Alerting into a separate app called Alertmanager. You configure your alerting rules inside of Prometheus, and Prometheus sends its alerts to an Alertmanager instance. The Alertmanager then, based on the rules you’ve configured, deduplicates, groups, and routes them to the correct receiver, which could be sent out via email, Slack, PagerDuty, Opsgenie, HipChat, and more. The Prometheus Alertmanager configuration documentation has example configs for all of the aforementioned alert receivers.

I was able to quickly set up a Slack Webhook app in a Slack channel named #prometheus-slack, and add an alert rule to Prometheus instructing it to send an alert to Slack when the sum of all Pacemaker’s resources’ failcounts exceed zero for longer than five minutes. I created a resource failure in my Pacemaker cluster by removing the cluster managed virtual IP address from the active cluster node, and five minutes later – as configured – I received a message in Slack with a link to the alert specifics in Prometheus:

Integrating Grafana

The last – but probably the most satisfying – part of my testing was integrating Grafana into the monitoring stack for data visualization. Grafana is an open source project that enables you to query, collect, alert, and visualize data gathered from any of its data source plugins (many of which come bundled with Grafana), into a single place as easily understandable dashboards. Grafana, since version 2.5.0, natively includes Prometheus as one of its data source plugins, so integration is very easy.

In less than 30 minutes, I had set up what I thought was a nice little HA Cluster Dashboard using the same, or very similar, expressions as used when querying Prometheus directly. Grafana makes customizing the visualizations very easy through its web front end. For example the “Pacemaker Resource Failures” counter value in the HA Cluster Dashboard depicted below, will turn from green to red when the counter exceeds zero:

Grafana Dashboard

I’ve been very impressed by how easily all these different tools integrate with one another and encourage anyone wondering what they should do for monitoring to test it out. Even though this blog describes monitoring, alerting, and visualizing machine-centric metrics from Linux HA Cluster nodes, the ease of integration fits perfectly into the new world of microservices and cloud-native applications, so use cases and implementations are plentiful. For example, using Prometheus, and Prometheus’ Node Exporter on my Kubernetes worker nodes to expose metrics pertaining to dynamically provisioned LINSTOR volumes is a no-brainer for a future blog post.

If you’re already using Prometheus and the ha_cluster_exporter to monitor your Linux HA Clusters, let us know how in the comments or via email.

Matt Kereczman on Linkedin
Matt Kereczman
Matt is a Linux Cluster Engineer at LINBIT with a long history of Linux System Administration and Linux System Engineering. Matt is a cornerstone in LINBIT’s support team, and plays an important role in making LINBIT’s support great. Matt was President of the GNU/Linux Club at Northampton Area Community College prior to graduating with Honors from Pennsylvania College of Technology with a BS in Information Security. Open Source Software and Hardware are at the core of most of Matt’s hobbies.

Dreaded Day of Downtime

Some say that no one dreads a day of downtime like a storage admin.

I disagree. Sure, the storage admins might be responsible for recovering a whole organization if an outage occurs; and sure, they might be the ones who lose their jobs from an unexpected debacle, but I would speculate that others have more to lose.

First, the company’s reputation takes a big, possibly irreparable hit with both clients and  employees. Damage control usually lasts far longer than the original outage.  Take the United Airlines case from earlier in 2017 when a computer malfunction led to the grounding of all domestic flights. Airports across the country were forced to tweet out messages about the technical issues after receiving an overwhelming number of complaints. Outages such as this one can take months or years to repair the trust with your customers. Depending upon the criticality of the services, a company could go bankrupt. Despite all this, even the company isn’t the biggest loser; it is the end-user: and that is what the rest of this post will focus on.

Let’s say you’re a senior in college. It’s spring term, and graduation is just one week away.  Your school has an online system to submit assignments which are due at midnight, the day before finals week. Like most students at the school, you log into the online assignment submission module, just like you have always done.  Except this time, you get a spinning wheel. Nothing will load. It must be your internet connection. You call a friend to have them submit your papers, but she can’t login either. The culprit: the system is down.

Now, it’s 10:00 PM and you need to submit your math assignment before midnight. At 11:00 PM you start to panic. You can’t log-in and neither can your classmates.  Everyone is scrambling. You send a hastily written email to your professor explaining the issue. She is unforgiving because you shouldn’t have procrastinated in the first place. At 1:00 AM, you refresh the system and everything is working (slowly), but the deadlines have passed. The system won’t let you submit anything. Your heart sinks as you realize that without that project, you will fail your math class and not be able to graduate.

This system outage caused heartache, stress and uncertainty for the students and teachers along with a whole lot of pain for the administrators.  The kicker is that the downtime happened when traffic was anticipated to be the highest! Of course, the servers are going to be overloaded during the last week of Spring term. Yet, notoriously, the University will send an email stating that it experienced higher than expected loads; and that ultimately, they weren’t prepared for it.

During this time, traffic was 15 times its normal usage, and the Hypervisor hosting the NFS server and the file sharing system was flooded with requests.  It blew a fan and eventually overheated. Sure, the data was still safe inside the SAN on the backend.  However, none of that mattered when the students couldn’t access the data until the admin rebuilt the Hypervisor. By the time the server was back up and running, the damage was done.

High Availability isn’t a simple concept but it is critical for your organization, your credibility, and even more importantly, for your end-users or customers. In today’s world, the bar for “uptime” is monstrously high therefore downtime is simply unacceptable.

If you’re a student, an admin or a simple system user- I have a question for you (and don’t just think about yourself, think about your boss, colleagues, and clients):

What would your day look like if your services went unresponsive right… NOW?!
Learn more about the costs and drivers of data loss, and how to avoid it, by reading the paper from OrionX Research.


Greg Eckert on Linkedin
Greg Eckert
In his role as the Director of Business Development for LINBIT America and Australia, Greg is responsible for building international relations, both in terms of technology and business collaboration. Since 2013, Greg has connected potential technology partners, collaborated with businesses in new territories, and explored opportunities for new joint ventures.

Don’t Settle for Downtime

Innovative Data Storage Can Save Cash, Headaches, and Your Data

Storage Downtime is Unacceptable

When the network goes down, everyone is mildly annoyed, but when the storage goes down,  “Everyone loses their mind, ” as the Joker would say.  And for good reason. No one likes losing payroll data, shipments, customer information, financial transactions, or CRM information… And they certainly don’t like waiting while you roll back to your latest backup. Internally and externally, data-loss and downtime wastes valuable resources and it hurts company reputation. Downtime is becoming less acceptable every day, and data-loss, even more so. Stable, safe, and secure storage should be a priority for those responsible for protecting their business (just ask Equifax).

Traditional Solutions

Due to the increasing need for high availability (HA) and disaster recovery (DR), proprietary storage companies like NetApp and Dell EMC have provided SAN and NAS technologies to protect your organization’s most important data. These hardware appliances, many times, have no single point of failure, synchronous data replication and even a nice GUI so that users can point-and-click their way around. The downside? These storage appliances aren’t scalable and they are expensive. Really expensive.

The Obvious (or not so obvious) Alternative

Did you know that resiliency is built into your Linux OS? That’s right, built into the mainline linux kernel is everything you need to replace your shared storage. For over 15 years, LINBIT has been creating the DRBD software, designed to synchronously replicate data between Linux servers seamlessly just like your SAN. It can even trick the application above to believing they are writing to a SAN, when in reality, it is standard X86, ARM, or Power boxes. The full LINBIT HA solution combines the DRBD software with open source fail-over software as well. This combination eliminates the need for proprietary shared storage solutions. So, why aren’t you using it? You probably didn’t know that it existed.


For the past 20 years, those with IT know-how, and small budgets found that HA clustering, using commodity off-the-shelf hardware, was an affordable alternative to traditional storage methods. This crowd consisted of the standard Linux hacker rolling out a home-brewed web-server, and the hyperscale players who didn’t want to rely on outside vendors to build their cloud. Being that these hyperscale companies are using the software to create a competative advantage against their competitors they aren’t all-that-eager to share their stories. They have kept the mid-market in the dark.

Almost all of the major players (including Google, Cisco, Deka Bank, HP, Porsche, and the BBC) have realized that using standard hardware instead of proprietary appliances creates a competitive advantage. Namely: inexpensive resilient storage that their competitors are paying an arm and a leg for. Now, the storage industry’s best kept secret is finally out.

It Doesn’t Stop There

LINBIT is pioneering open source SDS. In development for over 7 years, the new solution will create standard High Availability clusters like described above, and also work perfectly for cloud storage. The LINBIT SDS software introduces performance advantages scalability to the  design. LINBIT’s created a sort of “Operating System based,” Open Source, Software Defined Storage technology that is already built into your existing operating system and ready to use with any Linux system.

The Default Replication Option

LINBIT’s DRBD software receives about 10,000 confirmed downloads per month (people who opt-in to show their statistics). LINBIT is far more engineering and development focused than sales focused so if you aren’t solving a real-world problem you have probably never ran into them. LINBIT’s software popularity is user driven, and due to 3 main reasons:

Flexibility: Since the DRBD software replicates data at the block level, it works with any filesystem, VM, or application that writes data to a hard drive. It can replicate multiple resources simultaneously so users don’t have to choose different replication technologies for every application/database running on the server.

Stability: Being accepted into the mainline Linux kernel is a very stringent process. DRBD has been in the kernel since 2009, version 2.6.33

Synchronous: Prior to DRBD’s availability (no pun intended), the only option for synchronous replication was hardware (SAN, NAS devices). The DRBD software can run in synchronous or asynchronous mode, and be used for local replication or Geo Clustering across long distances.

Now that DRBD has tools to provision your storage, scaling out has never been easier. Interested in how this might apply for your projects? Check out some of LINBIT’s  (free) innovative technical documents which describe how to set up a cluster for your specific environment. Have an idea that isn’t covered in the documentation? Reach out to [email protected] and ask if your idea is sane. They’ll consult the LINBIT engineering team, and will point you in the right direction. Most importantly, NEVER settle for unplanned downtime.

Find out more about the costs of downtime in the podcast, The OrionX Download with LINBIT CEO, Brian Hellman.

DRBD and Randtronics DPM

Today we’re happy to announce a new document titled “Block Replication with Filesystem Encryption” which showcases another wonderful use case for DRBD.

Block Replication with Filesystem Encryption

At Hosting Con, back in April of this year, some colleagues of mine ran into some representatives from Randtronics. Randtronics is the company responsible for the DPM (Data Privacy Management) software suite. This software suite provides file encryption, user management, ACLs, and more. I could imagine this software would prove useful to those in fields where data privacy is an absolute must. Fields such as the medical, legal, human resources, or intellectual property, quickly come to mind.

(Graphic is property of Randtronics)

After a brief discussion with us regarding just how versatile DRBD can be it was decided to see if perhaps DRBD could work seamlessly with DPM. Randtronic’s DPM can help protect your data from prying eyes, or those who may wish to steal it, but can it protect your data from system failures? When teamed up with DRBD you can be assured that your data is both secure and available.

I worked briefly with Gary Lansdown of Randtronics to introduce him to asciidoc, but I must give credit to Randtronics for this document.

Secure Linux: Atomicorp includes DRBD for replication

Every so often we get a chance to test new¹ software. Usually this opportunity is driven by the question: Does DRBD play nicely with it?

At HostingCon this year, we met a team from Atomicorp and decided that it would be interesting to see if we could get DRBD running on this hardened version of Linux. Overall, LINBIT’s broad client-base loosly includes “security” since “Availability” is one of the 3 Security pillars of the CIA triad.


Image Source: Panmore Institute

Security certainly fits with Atomicorp since they focus on clients in the federal, financial, healthcare, and hosting space. Their HQ is based in the same business park as Raytheon, Boeing, and Booz Allen Hamilton, if that tells you anything about their market.

We frequently take on the challenge of seeing if we can get DRBD compiled and working correctly, like that time we installed it on 2 raspberry pi’s, and this case was no different. While we were confident that there wouldn’t be issues with installation, — after all, it’s Linux — we needed to verify compatibility with the ASL (Atomic Secured Linux™) hardened kernel before announcing that it works.

After speaking with the Atomicorp team, they let us know that some of their clients were already running DRBD and Pacemaker for High Availability within their data centers. That’s great news! We anticipated that the testing would go quickly since we already had verified users.

Upon installing DRBD on a pair of RHEL 7 systems, we found something unexpected. DRBD is already included in the ASL kernel. This means Atomicorp is hardening and packaging a newer mainline kernel instead of hardening that which the distribution supplies. Nice work Atomicorp! The DRBD 8.4.5 version in the ASL kernel is pretty recent too.

It’s funny. Clients often ask us if we have seen DRBD used for their specific use case. DRBD is so versatile that we’re not always familiar with every situation. If we had been asked if anyone was using DRBD with Atomicorp’s ASL product, we would have said “I don’t know.” The irony here is that when you install the ASL hardened kernel, you may automatically get DRBD on a distribution where you otherwise may have not. It is available for everyone who runs Atomicorp’s ASL kernel whether the end user leverages the replication functionality or not².

This isn’t just a fun, internal office story; this is the essence of how Open Source Software works. We now know that there is a connection between ASL and DRBD, and are delighted to work with Atomicorp moving forward. It just makes sense since end-clients of both Atomicorp and LINBIT achieve feature-sets that they wouldn’t have otherwise. Altogether, our partners help advocate for our open source software and when our solutions are combined, everyone keeps inching toward bigger and better solutions, while maintaining focus on their core competencies.

So does the DRBD software work with Atomicorp and the Atomic Secured Linux™ kernel? Of course it does; and now, for the next few weeks, I get to be mocked by my coworkers for having our engineers test something which already had our software baked into it. 😉


1: New to us.
2: You’ll still need the userland utilities to manage and initialize DRBD, but that’s less of security concern than compiling and inserting a kernel module.

Would you want to be your own car mechanic?

Data seems to be on everyone’s mind these days.  From employee to financial data, your company has to keep it available through seamless replication — without downtime. LINBIT DRBD is the open source software that ensures High Availability for your enterprise.

Read more

Persistent and Replicated Docker Volumes with DRBD9 and DRBD Manage

drbdmanage has been replaced by LINSTOR.

You find more information about LINSTOR:


Thank you and have a fun read!



Nowadays, Docker has support for plugins; for LINBIT, volume plugins are certainly the most interesting feature. Volume plugins open the way for storing content residing in usual Docker volumes on DRBD backed storage.

In this blog post we show a simple example of using our new Docker volume plugin to create a WordPress powered blog with a MariaDB database, where both the content of the blog and the database is replicated among two cluster nodes. Read more

Testing SSD Drives with DRBD: Intel DC 3700 Series

Over the next few weeks we’ll be posting results from tests that we’ve run against various manufactures SSD drives; including Intel, SanDisk, and Micron, to name a few.

The first post in this series goes over our findings of the Intel DC S 3700 Series 800GB SATA SSD drives. Read more

Change the cluster distribution without downtime

Recently we’ve upgraded one of our virtualization clusters (more RAM), and in the course of this did an upgrade of the virtualization hosts from Ubuntu Lucid to RHEL 6.3 — without any service interruption. Read more

Mirrored SAN vs. DRBD

Every now and then we get asked “why not simply use a mirrored SAN instead of DRBD”? This post shows some important differences. Read more