Checking service health status with Consul

In previous post we created a small Consul cluster which kept track of 4 services in it: two web services and two db‘s. However, we didn’t tell Consul agents how to monitor those services, so they completely missed the fact that none of the services actually exists. So today we’re going to take a close look at Consul’s health checks and see what effect they have on service discoverability.

What’s Consul health check

Consul can assign zero or more different health check procedures to individual services or to the whole host. The check can be as simple as a HTTP call to the service, ping request, or something more sophisticated like launching an external program and analyzing its exit code. Then, depending on check result, Consul will put the service (or all services at the host) to one of the three states: healthy, warning or critical.

‘Healthy’ and ‘warning’ services will behave like usual, but Consul gives special treatment for ‘critical’ ones: they won’t be discoverable via DNS. What’s even more, if service doesn’t get back healthy in given amount of time, it can be unregistered completely. However, such behavior is disabled by default.

Types of health checks

There’re five types of health checks: HTTP, TCP, Script, TLS and Docker.

HTTP

HTTP check is good old HTTP call to arbitrary URL. If URL responds with 2xx status code, then naturally service is considered to be healthy. 429 (too many requests) will result a warning and everything else – critical status. What’s convenient Consul stores HTTP response as a note to health check, so service can return some additional details about its health.

Configuring HTTP health check is quite trivial:

{
  "id": "web",
  "name": "NGINX responds at port 80",
  "http": "http://localhost",
  "interval": "10s"
}

{

"id": "web",

"name": "NGINX responds at port 80",

"http": "http://localhost",

"interval": "10s"

}

TCP

TCP health check on the other hand is even simpler and it just tests whether or not the host is reachable, resulting healthy or critical statuses.

  ...
    "tcp": "127.0.0.1:8500",
  ...

...

"tcp": "127.0.0.1:8500",

...

TTL

TTL (Time to Live) uses completely different approach. It ensures that service itself pings Consul agent once in a while. We have to define maximum time between the pings and as soon as the service misses time window it’s considered to be in critical state.

There’re three URLs that service can send GET request to:

/v1/agent/check/pass/%checkId%
/v1/agent/check/warn/%checkId%
/v1/agent/check/fail/%checkId%

/v1/agent/check/pass/%checkId%

/v1/agent/check/warn/%checkId%

/v1/agent/check/fail/%checkId%

Accessing any of those URLs will reset TTL timer and additionally will put the service in one of three health states. JSON configuration for the check is still trivial:

{
  "id": "myserv",
  "name": "My Service Status",
  "notes": "It should ping the agent at least once per 30s",
  "ttl": "30s"
}

{

"id": "myserv",

"name": "My Service Status",

"notes": "It should ping the agent at least once per 30s",

"ttl": "30s"

}

Script

As name suggests, ‘script’ starts external application or script and interprets its exit code as new health status: 0 for healthy, 1 for warning and anything else for critical. It also remembers the output and adds it to health check notes.

  ...
    "name": "Memory usage",
    "script": "/home/docker/memusage.sh",
    "interval": "10s"
  ...

...

"name": "Memory usage",

"script": "/home/docker/memusage.sh",

"interval": "10s"

...

It’s probably not immediately obvious, but checking for service health status doesn’t necessarily mean checking if it exists. We also can check if we still have enough of disk space and memory, or is cpu usage is reasonable, and the script check is the way to do it.

Docker

Docker check is very similar to script check. The difference is that we use docker exec command to run the script.

  ...  
    "docker_container_id": "bceff99",
    "shell": "/bin/bash",
    "script": "/root/memusage.sh"
  ...

...

"docker_container_id": "bceff99",

"shell": "/bin/bash",

"script": "/root/memusage.sh"

...

Improving out cluster with health checks

The biggest problem of the cluster we created last time was the lack of meaning and health checks. Even though there’s not much we can do about the first one, adding health checks is something very doable.

I’m going to add two of them: one for testing web service, and one for testing the whole host.

HTTP check for web service

If you remember, our config file with services definition looked like this:

{ 
  "services": [{
      "name": "web" 
    }, {
      "name": "db"
  }]
}

{

"services": [{

"name": "web"

}, {

"name": "db"

}]

}

In order to add a check to web service we either can add a check definition to the configuration root and point it to the service via service_id: "id" key, or add the check directly to the service definition. I’ll use the second approach.

The check itself will be making requests to localhost, port 80, once per 15 seconds. This is how it looks in our configuration file:

{
  "services": [{
    "name": "web",
    "checks": [{
      "id": "web-ping",
      "http": "http://127.0.0.1",
      "interval": "15s"
    }]
  }, {
    "name": "db"
  }]
}

{

"services": [{

"name": "web",

"checks": [{

"id": "web-ping",

"http": "http://127.0.0.1",

"interval": "15s"

}]

}, {

"name": "db"

}]

}

Now, let’s save it as services.json and feed it to Consul agents at host-1 and host-2:

docker@host-1:~$ ./consul agent -advertise 192.168.99.102 -retry-join 192.168.99.100 -data-dir /tmp/consul \
                 -config-file services.json

docker@host-2:~$ ./consul agent -advertise 192.168.99.101 -retry-join 192.168.99.100 -data-dir /tmp/consul \
                 -config-file services.json

docker@host-1:~$ ./consul agent -advertise 192.168.99.102 -retry-join 192.168.99.100 -data-dir /tmp/consul \

-config-file services.json

docker@host-2:~$ ./consul agent -advertise 192.168.99.101 -retry-join 192.168.99.100 -data-dir /tmp/consul \

-config-file services.json

As a side note, since the last post I restarted the whole cluster and it got new IPs: 192.168.99.100 for consul-server and 192.168.99.102 and .101 for host-1 and host-2.

OK, services restarted, let’s head to ‘services’ page at Consul server UI:

Right on the first page it says that both web services are in critical state. Why wouldn’t they be, we haven’t created them yet. But before we do so, let’s try to run some DNS discovery queries.

DNS query for failed services

We know that db is still considered healthy, so let’s check it first:

dig @192.168.99.100 -p 8600 db.service.consul SRV
#...
#;; ANSWER SECTION:
#db.service.consul.	0	IN	SRV	1 1 0 host-2.node.dc1.consul.
#db.service.consul.	0	IN	SRV	1 1 0 host-1.node.dc1.consul.
#
#;; ADDITIONAL SECTION:
#host-2.node.dc1.consul.	0	IN	A	192.168.99.101
#host-1.node.dc1.consul.	0	IN	A	192.168.99.102
#
#;; Query time: 43 msec
#...

dig @192.168.99.100 -p 8600 db.service.consul SRV

#...

#;; ANSWER SECTION:

#db.service.consul. 0 IN SRV 1 1 0 host-2.node.dc1.consul.

#db.service.consul. 0 IN SRV 1 1 0 host-1.node.dc1.consul.

#;; ADDITIONAL SECTION:

#host-2.node.dc1.consul. 0 IN A 192.168.99.101

#host-1.node.dc1.consul. 0 IN A 192.168.99.102

#;; Query time: 43 msec

#...

As expected, request returned two entries for host-1 and host-2. Now, let’s try to find out anything about web:

dig @192.168.99.100 -p 8600 web.service.consul SRV
#...
#;; QUESTION SECTION:
#;web.service.consul.		IN	SRV
#
#;; AUTHORITY SECTION:
#consul.			0	IN	SOA
#
#;; Query time: 38 msec
#...

dig @192.168.99.100 -p 8600 web.service.consul SRV

#...

#;; QUESTION SECTION:

#;web.service.consul. IN SRV

#;; AUTHORITY SECTION:

#consul. 0 IN SOA

#;; Query time: 38 msec

#...

Nothing.

OK, let’s fix that. They need web – we have plenty of that:

docker@host-1:~$ docker run -d --name nginx -p 80:80 nginx
#cc4d9cc7284b083b981900a2e7e8737f6bb1647e605bab067fd9b257d76e7d40

docker@host-2:~$ docker run -d --name nginx -p 80:80 nginx
#80b516609f5730054893501bcc385f1fd4d20b7eaef36311a4ba25e9ccc08954

docker@host-1:~$ docker run -d --name nginx -p 80:80 nginx

#cc4d9cc7284b083b981900a2e7e8737f6bb1647e605bab067fd9b257d76e7d40

docker@host-2:~$ docker run -d --name nginx -p 80:80 nginx

#80b516609f5730054893501bcc385f1fd4d20b7eaef36311a4ba25e9ccc08954

After starting two nginx containers at port 80, health checks should calm down:

They even have HTTP output in their notes:

DNS query for web service also will be successful now.

Host-wide health check

We also can run a health check for the whole host. Let’s try to put the host into critical state when in thinks it’s running out of memory. For that I’ve created a small shell script, so obviously we’ll use ‘script’ check to run it.

The script itself is relatively easy. It reads memory usage, prints statistics to the output, so the check has something to store as a note to current health status, calculates percentage of used memory and exits with codes 2/1/0 for critical/warning/healthy states when memory usage is greater than 98%, 80% or zero percent.

#!/bin/sh

free
memusage=$(free | awk 'NR==2{printf "%d", $3*100/$2}')

echo
echo "Memory usage is roughly $memusage%"

if [ $memusage -gt "98" ]; then
	echo "Critical state"
	exit 2
elif [ $memusage -gt "80" ]; then
	echo "Warning state"
	exit 1
else
	exit 0
fi

#!/bin/sh

free

memusage=$(free | awk 'NR==2{printf "%d", $3*100/$2}')

echo

echo "Memory usage is roughly $memusage%"

if [ $memusage -gt "98" ]; then

echo "Critical state"

exit 2

elif [ $memusage -gt "80" ]; then

echo "Warning state"

exit 1

else

exit 0

Consul agent config file changes just a little bit:

{
  "services": [{
    "name": "web",
    //...
  }, {
    "name": "db"
  }],
  "check": {
    "id": "Memory usage",
    "script": "/home/docker/memusage.sh",
    "interval": "15s"
  }
}

{

"services": [{

"name": "web",

//...

}, {

"name": "db"

}],

"check": {

"id": "Memory usage",

"script": "/home/docker/memusage.sh",

"interval": "15s"

}

Update it, restart the agents and head back to server UI:

Both hosts received “Memory usage” check, which currently indicates only 6% usage. I wonder, however, what would happen if I lowered memory threshold at e.g. host-1 from 98% down to 5%. There’s only one way to check: edit, restart and refresh the UI.

No surprise that host-1 is now in critical state:

What’s cool, both web and db services it hosts are also in critical state and therefore excluded from DNS queries:

dig @192.168.99.100 -p 8600 web.service.consul SRV
#...
#web.service.consul.	0	IN	SRV	1 1 0 host-2.node.dc1.consul.
#...
#host-2.node.dc1.consul.	0	IN	A	192.168.99.101
#...

dig @192.168.99.100 -p 8600 db.service.consul SRV
#...
#db.service.consul.	0	IN	SRV	1 1 0 host-2.node.dc1.consul.
#...
#host-2.node.dc1.consul.	0	IN	A	192.168.99.101
#...

dig @192.168.99.100 -p 8600 web.service.consul SRV

#...

#web.service.consul. 0 IN SRV 1 1 0 host-2.node.dc1.consul.

#...

#host-2.node.dc1.consul. 0 IN A 192.168.99.101

#...

dig @192.168.99.100 -p 8600 db.service.consul SRV

#...

#db.service.consul. 0 IN SRV 1 1 0 host-2.node.dc1.consul.

#...

#host-2.node.dc1.consul. 0 IN A 192.168.99.101

#...

Conclusion

Today we took a look at ways to check if services in our cluster are healthy and therefore should be discoverable by other peers. For that Consul has a variety of checks starting from simple pings and HTTP requests and ending with something more powerful like script. All of those checks can be applied either to a single service or to the whole host with little to no changes in their definitions.

However, these Consul health checks aren’t meant to be a replacement for proper host and application monitoring. It’s merely a way for Consul agent to know if the service it supposed to keep an eye on is still alive and operational.

What’s Consul health check

Types of health checks

HTTP

TCP

TTL

Script

Docker

Improving out cluster with health checks

HTTP check for web service

DNS query for failed services

Host-wide health check

Conclusion

One thought on “Checking service health status with Consul”

Leave a Reply Cancel reply

What’s Consul health check

Types of health checks

HTTP

TCP

TTL

Script

Docker

Improving out cluster with health checks

HTTP check for web service

DNS query for failed services

Host-wide health check

Conclusion

Share this:

You might also like

How to use Vagrant to create Consul cluster

Another shiny toy – serverless application

Asynchronous communication with message queue

One thought on “Checking service health status with Consul”

Leave a Reply Cancel reply