In previous post we created a small Consul cluster which kept track of 4 services in it: two web
services and two db
‘s. However, we didn’t tell Consul agents how to monitor those services, so they completely missed the fact that none of the services actually exists. So today we’re going to take a close look at Consul’s health checks and see what effect they have on service discoverability.
What’s Consul health check
Consul can assign zero or more different health check procedures to individual services or to the whole host. The check can be as simple as a HTTP call to the service, ping request, or something more sophisticated like launching an external program and analyzing its exit code. Then, depending on check result, Consul will put the service (or all services at the host) to one of the three states: healthy, warning or critical.
‘Healthy’ and ‘warning’ services will behave like usual, but Consul gives special treatment for ‘critical’ ones: they won’t be discoverable via DNS. What’s even more, if service doesn’t get back healthy in given amount of time, it can be unregistered completely. However, such behavior is disabled by default.
Types of health checks
There’re five types of health checks: HTTP, TCP, Script, TLS and Docker.
HTTP
HTTP check is good old HTTP call to arbitrary URL. If URL responds with 2xx status code, then naturally service is considered to be healthy. 429 (too many requests) will result a warning and everything else – critical status. What’s convenient Consul stores HTTP response as a note to health check, so service can return some additional details about its health.
Configuring HTTP health check is quite trivial:
1 2 3 4 5 6 |
{ "id": "web", "name": "NGINX responds at port 80", "http": "http://localhost", "interval": "10s" } |
TCP
TCP health check on the other hand is even simpler and it just tests whether or not the host is reachable, resulting healthy or critical statuses.
1 2 3 |
... "tcp": "127.0.0.1:8500", ... |
TTL
TTL (Time to Live) uses completely different approach. It ensures that service itself pings Consul agent once in a while. We have to define maximum time between the pings and as soon as the service misses time window it’s considered to be in critical state.
There’re three URLs that service can send GET request to:
1 2 3 |
/v1/agent/check/pass/%checkId% /v1/agent/check/warn/%checkId% /v1/agent/check/fail/%checkId% |
Accessing any of those URLs will reset TTL timer and additionally will put the service in one of three health states. JSON configuration for the check is still trivial:
1 2 3 4 5 6 |
{ "id": "myserv", "name": "My Service Status", "notes": "It should ping the agent at least once per 30s", "ttl": "30s" } |
Script
As name suggests, ‘script’ starts external application or script and interprets its exit code as new health status: 0
for healthy, 1
for warning and anything else for critical. It also remembers the output and adds it to health check notes.
1 2 3 4 5 |
... "name": "Memory usage", "script": "/home/docker/memusage.sh", "interval": "10s" ... |
It’s probably not immediately obvious, but checking for service health status doesn’t necessarily mean checking if it exists. We also can check if we still have enough of disk space and memory, or is cpu usage is reasonable, and the script check is the way to do it.
Docker
Docker check is very similar to script check. The difference is that we use docker exec
command to run the script.
1 2 3 4 5 |
... "docker_container_id": "bceff99", "shell": "/bin/bash", "script": "/root/memusage.sh" ... |
Improving out cluster with health checks
The biggest problem of the cluster we created last time was the lack of meaning and health checks. Even though there’s not much we can do about the first one, adding health checks is something very doable.
I’m going to add two of them: one for testing web
service, and one for testing the whole host.
HTTP check for web service
If you remember, our config file with services definition looked like this:
1 2 3 4 5 6 7 |
{ "services": [{ "name": "web" }, { "name": "db" }] } |
In order to add a check to web
service we either can add a check definition to the configuration root and point it to the service via service_id: "id"
key, or add the check directly to the service definition. I’ll use the second approach.
The check itself will be making requests to localhost, port 80, once per 15 seconds. This is how it looks in our configuration file:
1 2 3 4 5 6 7 8 9 10 11 12 |
{ "services": [{ "name": "web", "checks": [{ "id": "web-ping", "http": "http://127.0.0.1", "interval": "15s" }] }, { "name": "db" }] } |
Now, let’s save it as services.json
and feed it to Consul agents at host-1
and host-2
:
1 2 3 4 5 |
docker@host-1:~$ ./consul agent -advertise 192.168.99.102 -retry-join 192.168.99.100 -data-dir /tmp/consul \ -config-file services.json docker@host-2:~$ ./consul agent -advertise 192.168.99.101 -retry-join 192.168.99.100 -data-dir /tmp/consul \ -config-file services.json |
As a side note, since the last post I restarted the whole cluster and it got new IPs: 192.168.99.100 for consul-server
and 192.168.99.102 and .101 for host-1
and host-2
.
OK, services restarted, let’s head to ‘services’ page at Consul server UI:
Right on the first page it says that both web
services are in critical state. Why wouldn’t they be, we haven’t created them yet. But before we do so, let’s try to run some DNS discovery queries.
DNS query for failed services
We know that db
is still considered healthy, so let’s check it first:
1 2 3 4 5 6 7 8 9 10 11 12 |
dig @192.168.99.100 -p 8600 db.service.consul SRV #... #;; ANSWER SECTION: #db.service.consul. 0 IN SRV 1 1 0 host-2.node.dc1.consul. #db.service.consul. 0 IN SRV 1 1 0 host-1.node.dc1.consul. # #;; ADDITIONAL SECTION: #host-2.node.dc1.consul. 0 IN A 192.168.99.101 #host-1.node.dc1.consul. 0 IN A 192.168.99.102 # #;; Query time: 43 msec #... |
As expected, request returned two entries for host-1
and host-2
. Now, let’s try to find out anything about web
:
1 2 3 4 5 6 7 8 9 10 |
dig @192.168.99.100 -p 8600 web.service.consul SRV #... #;; QUESTION SECTION: #;web.service.consul. IN SRV # #;; AUTHORITY SECTION: #consul. 0 IN SOA # #;; Query time: 38 msec #... |
Nothing.
OK, let’s fix that. They need web – we have plenty of that:
1 2 3 4 5 |
docker@host-1:~$ docker run -d --name nginx -p 80:80 nginx #cc4d9cc7284b083b981900a2e7e8737f6bb1647e605bab067fd9b257d76e7d40 docker@host-2:~$ docker run -d --name nginx -p 80:80 nginx #80b516609f5730054893501bcc385f1fd4d20b7eaef36311a4ba25e9ccc08954 |
After starting two nginx containers at port 80, health checks should calm down:
They even have HTTP output in their notes:
DNS query for web
service also will be successful now.
Host-wide health check
We also can run a health check for the whole host. Let’s try to put the host into critical state when in thinks it’s running out of memory. For that I’ve created a small shell script, so obviously we’ll use ‘script’ check to run it.
The script itself is relatively easy. It reads memory usage, prints statistics to the output, so the check has something to store as a note to current health status, calculates percentage of used memory and exits with codes 2/1/0 for critical/warning/healthy states when memory usage is greater than 98%, 80% or zero percent.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#!/bin/sh free memusage=$(free | awk 'NR==2{printf "%d", $3*100/$2}') echo echo "Memory usage is roughly $memusage%" if [ $memusage -gt "98" ]; then echo "Critical state" exit 2 elif [ $memusage -gt "80" ]; then echo "Warning state" exit 1 else exit 0 fi |
Consul agent config file changes just a little bit:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ "services": [{ "name": "web", //... }, { "name": "db" }], "check": { "id": "Memory usage", "script": "/home/docker/memusage.sh", "interval": "15s" } } |
Update it, restart the agents and head back to server UI:
Both hosts received “Memory usage” check, which currently indicates only 6% usage. I wonder, however, what would happen if I lowered memory threshold at e.g. host-1
from 98% down to 5%. There’s only one way to check: edit, restart and refresh the UI.
No surprise that host-1
is now in critical state:
What’s cool, both web
and db
services it hosts are also in critical state and therefore excluded from DNS queries:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
dig @192.168.99.100 -p 8600 web.service.consul SRV #... #web.service.consul. 0 IN SRV 1 1 0 host-2.node.dc1.consul. #... #host-2.node.dc1.consul. 0 IN A 192.168.99.101 #... dig @192.168.99.100 -p 8600 db.service.consul SRV #... #db.service.consul. 0 IN SRV 1 1 0 host-2.node.dc1.consul. #... #host-2.node.dc1.consul. 0 IN A 192.168.99.101 #... |
Conclusion
Today we took a look at ways to check if services in our cluster are healthy and therefore should be discoverable by other peers. For that Consul has a variety of checks starting from simple pings and HTTP requests and ending with something more powerful like script. All of those checks can be applied either to a single service or to the whole host with little to no changes in their definitions.
However, these Consul health checks aren’t meant to be a replacement for proper host and application monitoring. It’s merely a way for Consul agent to know if the service it supposed to keep an eye on is still alive and operational.
One thought on “Checking service health status with Consul”