Checking service health status with Consul

In previous post we created a small Consul cluster which kept track of 4 services in it: two web services and two db‘s. However, we didn’t tell Consul agents how to monitor those services, so they completely missed the fact that none of the services actually exists. So today we’re going to take a close look at Consul’s health checks and see what effect they have on service discoverability.

What’s Consul health check

Consul can assign zero or more different health check procedures to individual services or to the whole host. The check can be as simple as a HTTP call to the service, ping request, or something more sophisticated like launching an external program and analyzing its exit code. Then, depending on check result, Consul will put the service (or all services at the host) to one of the three states: healthy, warning or critical.

‘Healthy’ and ‘warning’ services will behave like usual, but Consul gives special treatment for ‘critical’ ones: they won’t be discoverable via DNS. What’s even more, if service doesn’t get back healthy in given amount of time, it can be unregistered completely. However, such behavior is disabled by default.

Types of health checks

There’re five types of health checks: HTTP, TCP, Script, TLS and Docker.

HTTP

HTTP check is good old HTTP call to arbitrary URL. If URL responds with 2xx status code, then naturally service is considered to be healthy. 429 (too many requests) will result a warning and everything else – critical status. What’s convenient Consul stores HTTP response as a note to health check, so service can return some additional details about its health.

Configuring HTTP health check is quite trivial:

TCP

TCP health check on the other hand is even simpler and it just tests whether or not the host is reachable, resulting healthy or critical statuses.

TTL

TTL (Time to Live) uses completely different approach. It ensures that service itself pings Consul agent once in a while. We have to define maximum time between the pings and as soon as the service misses time window it’s considered to be in critical state.

There’re three URLs that service can send GET request to:

Accessing any of those URLs will reset TTL timer and additionally will put the service in one of three health states. JSON configuration for the check is still trivial:

Script

As name suggests, ‘script’ starts external application or script and interprets its exit code as new health status: 0 for healthy, 1 for warning and anything else for critical. It also remembers the output and adds it to health check notes.

It’s probably not immediately obvious, but checking for service health status doesn’t necessarily mean checking if it exists. We also can check if we still have enough of disk space and memory, or is cpu usage is reasonable, and the script check is the way to do it.

Docker

Docker check is very similar to script check. The difference is that we use docker exec command to run the script.

Improving out cluster with health checks

The biggest problem of the cluster we created last time was the lack of meaning and health checks. Even though there’s not much we can do about the first one, adding health checks is something very doable.

I’m going to add two of them: one for testing web service, and one for testing the whole host.

HTTP check for web service

If you remember, our config file with services definition looked like this:

In order to add a check to web service we either can add a check definition to the configuration root and point it to the service via service_id: "id" key, or add the check directly to the service definition. I’ll use the second approach.

The check itself will be making requests to localhost, port 80, once per 15 seconds. This is how it looks in our configuration file:

Now, let’s save it as services.json and feed it to Consul agents at host-1 and host-2:

As a side note, since the last post I restarted the whole cluster and it got new IPs: 192.168.99.100 for consul-server and 192.168.99.102 and .101 for host-1 and host-2.

OK, services restarted, let’s head to ‘services’ page at Consul server UI:

Consul health check failing for web

Right on the first page it says that both web services are in critical state. Why wouldn’t they be, we haven’t created them yet. But before we do so, let’s try to run some DNS discovery queries.

DNS query for failed services

We know that db is still considered healthy, so let’s check it first:

As expected, request returned two entries for host-1 and host-2. Now, let’s try to find out anything about web:

Nothing.

OK, let’s fix that. They need web – we have plenty of that:

After starting two nginx containers at port 80, health checks should calm down:

Healthy web check

They even have HTTP output in their notes:

Web check output

DNS query for web service also will be successful now.

Host-wide health check

We also can run a health check for the whole host. Let’s try to put the host into critical state when in thinks it’s running out of memory. For that I’ve created a small shell script, so obviously we’ll use ‘script’ check to run it.

The script itself is relatively easy. It reads memory usage, prints statistics to the output, so the check has something to store as a note to current health status, calculates percentage of used memory and exits with codes 2/1/0 for critical/warning/healthy states when memory usage is greater than 98%, 80% or zero percent.

Consul agent config file changes just a little bit:

Update it, restart the agents and head back to server UI:

Host health check: healty

Both hosts received “Memory usage” check, which currently indicates only 6% usage. I wonder, however, what would happen if I lowered memory threshold at e.g. host-1 from 98% down to 5%. There’s only one way to check: edit, restart and refresh the UI.

No surprise that host-1 is now in critical state:

Consul host memory usage critical

What’s cool, both web and db services it hosts are also in critical state and therefore excluded from DNS queries:

All host services are failing

Conclusion

Today we took a look at ways to check if services in our cluster are healthy and therefore should be discoverable by other peers. For that Consul has a variety of checks starting from simple pings and HTTP requests and ending with something more powerful like script. All of those checks can be applied either to a single service or to the whole host with little to no changes in their definitions.

However, these Consul health checks aren’t meant to be a replacement for proper host and application monitoring. It’s merely a way for Consul agent to know if the service it supposed to keep an eye on is still alive and operational.

One thought on “Checking service health status with Consul

Leave a Reply

Your email address will not be published. Required fields are marked *