Docker health checks - Dots and Brackets: Code Blog

Somehow I missed the news that starting from version 1.12 Docker containers support health checks. Such checks don’t just test if container itself is running, but rather is it doing the job right. For instance, it can ping containerized web server to see if it responds to incoming requests, or measure memory consumption and see if it’s reasonable. As Docker health check is a shell command, it can test virtually anything.

When the test fails few times in a row, problematic container will get into “unhealthy” state, which makes no difference in standalone mode (except for triggered health_status event), but causes container to restart in Swarm mode.

How to enable Docker health check

There’re at least four places where health check can be enabled:

in Dockerfile,
in docker run command,
in docker-compose or docker stack YAML file
and in docker service create command.

As bare minimum we should provide a shell command to execute as a health check which should exit with 0 error code for healthy state and 1 for unhealthy. Additionally, we can specify how often the check should run (--interval), for how long (--timeout), and how many unhealthy results in a row should we get (--retries) before container gets into unhealthy state. All of these three are optional.

Health check instruction in Dockerfile

Imagine we’d want to check if a web server inside of a container still responds to incoming requests. As Dockerfile’s HEALTHCHECK instruction has the following format – HEALTHCHECK [OPTIONS] CMD command, and assuming our check should happen every 5 seconds, take no longer than 10 seconds and should fail at least three times in a row for container to become unhealthy, here’s how our check would look like:

HEALTHCHECK --interval=5s --timeout=10s --retries=3 CMD curl -sS 127.0.0.1 || exit 1

1	HEALTHCHECK --interval=5s --timeout=10s --retries=3 CMD curl -sS 127.0.0.1 \|\| exit 1

Because health check command it going to run from inside of a container, using 127.0.0.1 address for pinging the server is totally fine.

Health check in docker-compose YAML

It’s actually would look pretty much the same:

...
healthcheck:
  test: curl -sS http://127.0.0.1 || exit 1
  interval: 5s
  timeout: 10s
  retries: 3
...

...

healthcheck:

test: curl -sS http://127.0.0.1 || exit 1

interval: 5s

timeout: 10s

retries: 3

...

Health check in `docker run` and `service create`

Both docker run and docker service create commands share the same arguments for health checks and they are still very similar to ones you’d put into Dockerfile:

docker run --health-cmd='curl -sS http://127.0.0.1 || exit 1' \
    --health-timeout=10s \
    --health-retries=3 \
    --health-interval=5s \
    ....

docker run --health-cmd='curl -sS http://127.0.0.1 || exit 1' \

--health-timeout=10s \

--health-retries=3 \

--health-interval=5s \

....

What’s more, both of them (as well as docker-compose YAML) can override or even disable the check previously declared in Dockerfile with --no-healthcheck=true.

Docker health check example

The victim

I created a small node.js web server which simply responds to any request with ‘OK’. However, the server also has a switch that toggles ON/OFF state without actually shutting server’s process down. Here’s how it looks:

"use strict";
const http = require('http');

function createServer () {
	return http.createServer(function (req, res) {
		res.writeHead(200, {'Content-Type': 'text/plain'});
		res.end('OK\n');
	}).listen(8080);
}

let server = createServer();

http.createServer(function (req, res) {
	res.writeHead(200, {'Content-Type': 'text/plain'});
	if (server) {
		server.close();
		server = null;
		res.end('Shutting down...\n');
	} else {
		server = createServer();
		res.end('Starting up...\n');
	}
}).listen(8081);

"use strict";

const http = require('http');

function createServer () {

return http.createServer(function (req, res) {

res.writeHead(200, {'Content-Type': 'text/plain'});

res.end('OK\n');

}).listen(8080);

}

let server = createServer();

http.createServer(function (req, res) {

res.writeHead(200, {'Content-Type': 'text/plain'});

if (server) {

server.close();

server = null;

res.end('Shutting down...\n');

} else {

server = createServer();

res.end('Starting up...\n');

}

}).listen(8081);

So when the server is ON, it’ll listen at port 8080 and return OK to any request coming to that port. Making a call to port 8081 will shut the server down and another call will enable it back again:

$ node server.js
# switch to another terminal
curl 127.0.0.1:8080
# OK
curl 127.0.0.1:8081
# Shutting down...
curl 127.0.0.1:8080
# curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
curl 127.0.0.1:8081
# Starting up...
curl 127.0.0.1:8080
# OK

$ node server.js

# switch to another terminal

curl 127.0.0.1:8080

# OK

curl 127.0.0.1:8081

# Shutting down...

curl 127.0.0.1:8080

# curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused

curl 127.0.0.1:8081

# Starting up...

curl 127.0.0.1:8080

# OK

Now let’s put that server.js into a Dockerfile with a health check, build an image and start it as a container:

FROM node

COPY server.js /

EXPOSE 8080 8081

HEALTHCHECK --interval=5s --timeout=10s --retries=3 CMD curl -sS 127.0.0.1:8080 || exit 1

CMD [ "node", "/server.js" ]

FROM node

COPY server.js /

EXPOSE 8080 8081

HEALTHCHECK --interval=5s --timeout=10s --retries=3 CMD curl -sS 127.0.0.1:8080 || exit 1

CMD [ "node", "/server.js" ]

$ docker build . -t server:latest
# Lots, lots of output
$ docker run -d --rm -p 8080:8080 -p 8081:8081 server
# ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b
$ curl 127.0.0.1:8080
# OK

$ docker build . -t server:latest

# Lots, lots of output

$ docker run -d --rm -p 8080:8080 -p 8081:8081 server

# ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b

$ curl 127.0.0.1:8080

# OK

Created container ID starts with ec3, which should be enough to identify it later, so now we can jump to health checks.

Monitoring container health status

Docker’s main command for checking container’s health is docker inspect. It produces huge JSON in response, but the only part we interested in is its State.Health property:

$ docker inspect ec3 | jq '.[].State.Health'
#{
#  "Status": "healthy",
#  "FailingStreak": 0,
#  "Log": [
#    {
#      "Start": "2017-06-27T04:07:03.975506353Z",
#      "End": "2017-06-27T04:07:04.070844091Z",
#      "ExitCode": 0,
#      "Output": "OK\n"
#    },
#...
}

$ docker inspect ec3 | jq '.[].State.Health'

# "Status": "healthy",

# "FailingStreak": 0,

# "Log": [

# {

# "Start": "2017-06-27T04:07:03.975506353Z",

# "End": "2017-06-27T04:07:04.070844091Z",

# "ExitCode": 0,

# "Output": "OK\n"

# },

#...

}

Not surprisingly, current status is ‘healthy’ and we even can see health checks logs in Log collection. However, after making a call to port 8081 and waiting for 3*5 seconds (to allow three checks to fail) the picture will change.

$ curl 127.0.0.1:8081
# Shutting down...
# 15 seconds later
$ docker inspect ec3 | jq '.[].State.Health'
#{
#  "Status": "unhealthy",
#  "FailingStreak": 4,
#  "Log": [
#  ...
#   {
#      "Start": "2017-06-27T04:16:27.668441692Z",
#      "End": "2017-06-27T04:16:27.740937964Z",
#      "ExitCode": 1,
#      "Output": "curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused\n"
#    }
#  ]
#}

$ curl 127.0.0.1:8081

# Shutting down...

# 15 seconds later

$ docker inspect ec3 | jq '.[].State.Health'

# "Status": "unhealthy",

# "FailingStreak": 4,

# "Log": [

# ...

# {

# "Start": "2017-06-27T04:16:27.668441692Z",

# "End": "2017-06-27T04:16:27.740937964Z",

# "ExitCode": 1,

# "Output": "curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused\n"

# }

# ]

I waited a little bit longer than 15 seconds, so the health check managed to fail 4 times in a row (FailingStreak). And as expected, container’s status did change to ‘unhealthy’.

But as soon as at least one health check succeeds, Docker will put the container back to ‘healthy’ state:

$ curl 127.0.0.1:8081
# Starting up...
$ docker inspect ec3 | jq '.[].State.Health.Status'
# "healthy"

$ curl 127.0.0.1:8081

# Starting up...

$ docker inspect ec3 | jq '.[].State.Health.Status'

# "healthy"

Checking health status with Docker events

Along with inspecting container state directly, we also could’ve listen to docker events:

$ docker events --filter event=health_status
# 2017-06-27T00:23:03.691677875-04:00 container health_status: healthy ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b (image=server, name=eager_swartz)
# 2017-06-27T00:23:23.998693118-04:00 container health_status: unhealthy ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b (image=server, name=eager_swartz)

$ docker events --filter event=health_status

# 2017-06-27T00:23:03.691677875-04:00 container health_status: healthy ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b (image=server, name=eager_swartz)

# 2017-06-27T00:23:23.998693118-04:00 container health_status: unhealthy ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b (image=server, name=eager_swartz)

Docker events can be a little bit chatty that’s why I had to use the --filter. The command itself won’t exit right away and stay running, printing out events as they come.

Health status and Swarm services

In order to try how health checks affect Swarm services, I temporarily turned local Docker instance to Swarm mode by docker swarm init, and now I can do the following:

$ docker service create -p 8080:8080 -p8081:8081 \
    --name server \
    --health-cmd='curl -sS 127.0.0.1:8080' \
    --health-retries=3 \
    --health-interval=5s \ 
    server
#unable to pin image server to digest: errors:
#denied: requested access to the resource is denied
#unauthorized: authentication required

#ohkvwbsk06vkjyx69434ndqij

$ docker service create -p 8080:8080 -p8081:8081 \

--name server \

--health-cmd='curl -sS 127.0.0.1:8080' \

--health-retries=3 \

--health-interval=5s \

server

#unable to pin image server to digest: errors:

#denied: requested access to the resource is denied

#unauthorized: authentication required

#ohkvwbsk06vkjyx69434ndqij

This puts a new service into the Swarm using locally built server image. Docker wasn’t really happy with the fact that image is local and returned bunch of errors, but eventually it did return ID of newly created service:

docker service ls
#ID            NAME    MODE        REPLICAS  IMAGE
#ohkvwbsk06vk  server  replicated  1/1       server

docker service ls

#ID NAME MODE REPLICAS IMAGE

#ohkvwbsk06vk server replicated 1/1 server

curl 127.0.0.1:8080 will be working again and sending request to 8081 will, as usual, shut the server down. However, this time after a short while port 8080 will start working again without explicitly enabling the server. The thing is that as soon as Swarm manager noticed that container became unhealthy and therefore the whole service was no longer meeting desired state (‘running’), it shut down container completely and started a new one. We actually can see the traces of that by examining tasks collection for our server service:

$ docker service ps server
#ID            NAME          IMAGE   NODE  DESIRED STATE  CURRENT STATE              ERROR                             PORTS
#mt67hkhp7ycr  server.1      server  moby  Running        Running 50 seconds ago                                       
#pj77brhfhsjm   \_ server.1  server  moby  Shutdown       Failed about a minute ago  "task: non-zero exit (137): do…"

$ docker service ps server

#ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS

#mt67hkhp7ycr server.1 server moby Running Running 50 seconds ago

#pj77brhfhsjm \_ server.1 server moby Shutdown Failed about a minute ago "task: non-zero exit (137): do…"

As a little back story, every single Swarm container has a task assigned to it. When container dies, corresponding task gets shut down as well and Swarm creates a new pair of a task and a container. docker service ps displays the whole chain of tasks deaths and resurrections for given service and container. In our particular case server‘s initial task with id pj77brhfhsjm is marked as failed, and docker inspect even says why:

$ docker inspect pj77 | jq '.[].Status.Err'
# "task: non-zero exit (137): dockerexec: unhealthy container"

1 2	$ docker inspect pj77 \| jq '.[].Status.Err' # "task: non-zero exit (137): dockerexec: unhealthy container"

“Unhealthy container”, that’s why. But bottom line is that the service as a whole automatically recovered from unhealthy state with barely noticeably downtime.

Summary

Docker health checks is a cute little feature that allows attaching shell command to container and use it for checking if container’s content is alive enough. For containers in Docker engine’s standalone mode it just adds healthy/unhealthy attribute to it (plus health_status docker event when value changes), but in Swarm mode it actually will shutdown faulty container and create a new one with very little downtime.

5 thoughts on “Docker health checks”

Pingback: Using private registry in Docker Swarm - Dots and Brackets: Code Blog
Hossam Hammady says:

October 23, 2017 at 3:45 am

Thanks for the healthcheck story 🙂
I think you need to replace ‘|| echo 1’ by ‘|| exit 1’ after the curl command in some places.

1. pav says:
  
  October 24, 2017 at 11:13 pm
  
  Oh, how did I miss that.. At least it was correct in a half of the places. Thank you!
  
Jouni says:

November 17, 2017 at 11:06 pm

thanks a ton !! I am creating a company Docker concept and missing healthcheck was total showstopper. Funny how all docs tell that docker is SPOF free, which it is, but running services on swarm is far from SPOF free without health check.

Pingback: docker health check | ivannexus

How to enable Docker health check

Health check instruction in Dockerfile

Health check in docker-compose YAML

Health check in docker run and service create

Docker health check example

The victim

Monitoring container health status

Checking health status with Docker events

Health status and Swarm services

Summary

Share this:

You might also like

5 ways to deploy infrastructure to a cloud

Checking service health status with Consul

Quick intro to Graphite

5 thoughts on “Docker health checks”

Leave a Reply Cancel reply

Health check in `docker run` and `service create`