Consul for Cluster Health Monitoring

Posted by Owen Zanzal on May 22, 2015 8:09:00 AM

If you’re not familiar with Consul, it’s what I call a cluster management tool. It’s composed of a handful of features such as “Service Discovery”, “Key Value Store”, “DNS Server”, “Health Checking”, and it’s “Data Center Aware”. It ultimately allows you to manage an infrastructure composed of many applications, dynamically configure them, route traffic to the healthy ones, and reroute traffic away from those that are not healthy.

Cluster_Health_Monitoring

At VividCortex we don’t use most of Consul’s features (at least not yet). We only use Health Checking, and we use it instead of using Nagios. This may seem a little strange because Health Checking isn’t really a flagship feature of Consul like Service Discovery or Key Value Store.

What does Consul’s Health Checking do? It detects nodes in the cluster that are not healthy and then removes them from the list of available nodes. If you’re using the DNS Server, then dead nodes are no longer listed and will not receive traffic until they’re healthy again. If you’re using the REST API, then you can watch for that change and dynamically update your proxy configuration to remove unhealthy nodes.

That’s pretty cool – why aren’t we using this? The answer is that we don’t need it yet. Consul was basically brand new when we began using it, and we didn’t want to put it in charge of pulling/removing nodes until we got a better understanding of how to configure it. We simply wanted checks and alerts when various services on nodes weren’t behaving correctly or disk space threshold were being exceeded.

Nagios is commonly used for such alerts, but it seems like whenever you say Nagios in the ops world people look at you like you said a swear word. I don’t think I have ever met a person who likes Nagios. I can’t go into any specifics. I guess I’m lucky because I never had to install and configure it. I decided it was best to avoid Nagios and try to find something else. Consul has a very nice REST API and it comes with a clean and simple GUI, which is good enough to find the status of a check. Unlike Nagios, it’s distributed so there is no single point of failure which also allows it to scale much easier. It’s written in Golang, and if you know VividCortex, you know we like Go. These were the interesting attributes that separate Consul from Nagios and like systems.

Both Nagios and Consul perform health checks by running a script on the server. If it exits with a 0, it passes, if it exits a 1, it’s a warning, and anything above that is a failure. Scripts can be anything–BASH, Perl, Python, etc—so all the Nagios plugins you know and loath can be used with Consul. Checks come in two forms: either a service check that is associated with a specific HTTP service, or a general check which is not associated with any service. This lives in a config directory on each node Consul is deployed to. Below is an example of using the Nagios check_disk plugin, which can easily be installed with $ yum install nagios-plugins-disk.

{
    "check": {
        "id": "check-disk",
        "name": "Check Disk Utilization",
        "script": "/usr/lib64/nagios/plugins/check_disk -w 10% -c 5%",
        "interval": "30s"
    }
}

In addition to custom health checks, Consul has a built-in health check that creates a critical error when a node leaves the cluster without saying goodbye. This is called the Serf Health check. Serf is an implementation of GOSSIP Protocol used by Consul. At a high level, GOSSIP allows existing nodes in a cluster to discover new nodes in a decentralized and scalable manner. If a node leaves the cluster without explicitly saying it’s going to, we can take this as a strong signal that the node is no longer accessible inside the cluster, making Serf Health a valuable alert about the accessibility of a given node.

Another difference between Consul and Nagios is that Consul does not provide any builtin method to route a failing health notification to someone. This means that if you want to dispatch that critical alert to an email(s) or to some 3rd party ticketing service like PagerDuty, it’s up to you to do so. Consul does come with a cli tool which can help make this integration easier.

Consul has a feature called Watch, which lets you watch for changes on a particular API resource. When a change happens it will fire whatever script you specify. In that script you can obtain the changed resource and do something with it. In this case we can watch for health checks that are Critical and forward on the details of that Critical alert to a 3rd party ticketing and escalation service via an API request. Here is an example of pushing those alerts to VictorOps.

{
  "type": "checks",
  "key": "health/state/critical",
  "handler": "./critical-handler.sh"
}

The critical-handler.sh script looks something like this…

#!/bin/bash

[ $# -ge 1 -a -f "$1" ] && input="$1" || input="-"

json=$(cat $input)

TS=$(date +%s%3)
TOKEN="<your_token>"
API_URL=https://alert.victorops.com/integrations/generic/20131114/alert/$TOKEN/prod-critical

echo $json | jq -r 'keys[]' |\
while read key
do
  pl=$(echo $json | jq "{\"message_type\":\"CRITICAL\",\"timestamp\":\"$TS\",\"entity_id\":\"\(.[$key].Node)/\(.[$key].CheckID)\",\"state_message\":\"\(.[$key].Output)\"}")

  curl -X POST $API_URL -d $pl
done

With that script, we get VictorOps notifications when Consul runs Nagios health checks and finds problems. Cool!

In conclusion, it’s ok with us that Consul isn’t as sophisticated as Nagios when it comes to Infrastructure Monitoring. Despite not being designed purely for health monitoring, its distributed nature and usage of GOSSIP Protocol makes it a very robust and scalable alternative. We really only have a couple of alerts that we categorize as “Actionable”. Consul’s health checking combine with watches that route health check changes to a 3rd party ticketing/escalation service turns out to be all we need. As we grow and iterate our infrastructure we will continue to evaluate Consul’s other features as a possible solution. For now, we get value out of Consul, and it saved us the pain of having to maintain and configure Nagios.

Thumbnail pic cred

Recent Posts

Subscribe to Email Updates

Posts by Topic

see all