You’ve likely heard of Prometheus, the open-source monitoring and alerting solution originally developed at SoundCloud. Prometheus was the second project incubated by the Cloud Native Computing Foundation in 2016 (the first being Kubernetes), and is used by companies like Google, Red Hat, and Digital Ocean as a scalable and efficient way to monitor infrastructure in a cloud-native way. If this is the first you’ve heard of Prometheus, I would strongly recommend heading over to Prometheus.io to read through the introduction in their documentation.
ClusterLabs, the organization unifying the collection of open-source projects pertaining to Linux High Availability (HA) Clustering, recently added the
ha_cluster_exporter project to the organization’s GitHub (Fall 2019). The
ha_cluster_exporter project is a Prometheus exporter that exposes Linux HA Cluster metrics to Prometheus!
Linux HA Cluster with Prometheus
Linux HA Clustering is all about service availability and uptime, so you won’t have users informing you when a node in your cluster has died since they shouldn’t notice. Due to this, LINBIT has heard the following story a few times, “No one knew the Primary cluster node had failed until the Secondary node failed as well.” Ouch. Besides offering our condolences in what likely was a resume generating event (RGE) for someone, we usually can only suggest that they set up better monitoring and alerting of their cluster nodes via the software of their choice. However, after less than a day of playing with Prometheus and the ha_cluster_exporter, these tools may have just jumped to the top of my list of recommended monitoring stacks.
After running through Prometheus’ installation documentation, and compiling the ha_cluster_exporter as documented in the project’s README.md, I was quickly querying Prometheus for my Linux HA Cluster’s metrics:
- How synchronized is DRBD (kicked off a full resync prior to capturing):
- What’s the Pacemaker fail count on a given resource (p_vip_ip in this capture):
- Have the Corosync rings (communication networks) experienced any errors:
- Adding Prometheus’ Node Exporter – an open-source exporter developed under the Prometheus GitHub org – to my Linux HA Cluster nodes enabled me to extract metrics from the Linux Kernel (writes to DRBD in MB/s):
Alertmanager in Prometheus
For a complete list of the metrics scraped by the ha_node_exporter, please see the ha_node_exporter’s documentation on GitHub. For a complete list of the metrics scraped by Prometheus’ Node Exporter, as well as how to build and install it, please see the Prometheus Node Exporter’s documentation on GitHub.
Using Prometheus to collect and query these metrics is a great first step, but alerting is probably the most important and often the most difficult to configure aspect of monitoring. Prometheus separates Alerting into a separate app called Alertmanager. You configure your alerting rules inside of Prometheus, and Prometheus sends its alerts to an Alertmanager instance. The Alertmanager then, based on the rules you’ve configured, deduplicates, groups, and routes them to the correct receiver, which could be sent out via email, Slack, PagerDuty, Opsgenie, HipChat, and more. The Prometheus Alertmanager configuration documentation has example configs for all of the aforementioned alert receivers.
I was able to quickly set up a Slack Webhook app in a Slack channel named
#prometheus-slack, and add an alert rule to Prometheus instructing it to send an alert to Slack when the sum of all Pacemaker’s resources’ failcounts exceed zero for longer than five minutes. I created a resource failure in my Pacemaker cluster by removing the cluster managed virtual IP address from the active cluster node, and five minutes later – as configured – I received a message in Slack with a link to the alert specifics in Prometheus:
The last – but probably the most satisfying – part of my testing was integrating Grafana into the monitoring stack for data visualization. Grafana is an open source project that enables you to query, collect, alert, and visualize data gathered from any of its data source plugins (many of which come bundled with Grafana), into a single place as easily understandable dashboards. Grafana, since version 2.5.0, natively includes Prometheus as one of its data source plugins, so integration is very easy.
In less than 30 minutes, I had set up what I thought was a nice little HA Cluster Dashboard using the same, or very similar, expressions as used when querying Prometheus directly. Grafana makes customizing the visualizations very easy through its web front end. For example the “Pacemaker Resource Failures” counter value in the HA Cluster Dashboard depicted below, will turn from green to red when the counter exceeds zero:
I’ve been very impressed by how easily all these different tools integrate with one another and encourage anyone wondering what they should do for monitoring to test it out. Even though this blog describes monitoring, alerting, and visualizing machine-centric metrics from Linux HA Cluster nodes, the ease of integration fits perfectly into the new world of microservices and cloud-native applications, so use cases and implementations are plentiful. For example, using Prometheus, and Prometheus’ Node Exporter on my Kubernetes worker nodes to expose metrics pertaining to dynamically provisioned LINSTOR volumes is a no-brainer for a future blog post.
If you’re already using Prometheus and the ha_cluster_exporter to monitor your Linux HA Clusters, let us know how in the comments or via email.