Somehow I missed the news that starting from version 1.12 Docker containers support health checks. Such checks don’t just test if container itself is running, but rather is it doing the job right. For instance, it can ping containerized web server to see if it responds to incoming requests, or measure memory consumption and see if it’s reasonable. As Docker health check is a shell command, it can test virtually anything.
When the test fails few times in a row, problematic container will get into “unhealthy” state, which makes no difference in standalone mode (except for triggered health_status
event), but causes container to restart in Swarm mode.
How to enable Docker health check
There’re at least four places where health check can be enabled:
- in Dockerfile,
- in
docker run
command, - in
docker-compose
ordocker stack
YAML file - and in
docker service create
command.
As bare minimum we should provide a shell command to execute as a health check which should exit with 0
error code for healthy state and 1
for unhealthy. Additionally, we can specify how often the check should run (--interval
), for how long (--timeout
), and how many unhealthy results in a row should we get (--retries
) before container gets into unhealthy state. All of these three are optional.
Health check instruction in Dockerfile
Imagine we’d want to check if a web server inside of a container still responds to incoming requests. As Dockerfile’s HEALTHCHECK instruction has the following format – HEALTHCHECK [OPTIONS] CMD command
, and assuming our check should happen every 5 seconds, take no longer than 10 seconds and should fail at least three times in a row for container to become unhealthy, here’s how our check would look like:
1 |
HEALTHCHECK --interval=5s --timeout=10s --retries=3 CMD curl -sS 127.0.0.1 || exit 1 |
Because health check command it going to run from inside of a container, using 127.0.0.1 address for pinging the server is totally fine.
Health check in docker-compose YAML
It’s actually would look pretty much the same:
1 2 3 4 5 6 7 |
... healthcheck: test: curl -sS http://127.0.0.1 || exit 1 interval: 5s timeout: 10s retries: 3 ... |
Health check in docker run
and service create
Both docker run
and docker service create
commands share the same arguments for health checks and they are still very similar to ones you’d put into Dockerfile:
1 2 3 4 5 |
docker run --health-cmd='curl -sS http://127.0.0.1 || exit 1' \ --health-timeout=10s \ --health-retries=3 \ --health-interval=5s \ .... |
What’s more, both of them (as well as docker-compose YAML) can override or even disable the check previously declared in Dockerfile with --no-healthcheck=true
.
Docker health check example
The victim
I created a small node.js web server which simply responds to any request with ‘OK’. However, the server also has a switch that toggles ON/OFF state without actually shutting server’s process down. Here’s how it looks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
"use strict"; const http = require('http'); function createServer () { return http.createServer(function (req, res) { res.writeHead(200, {'Content-Type': 'text/plain'}); res.end('OK\n'); }).listen(8080); } let server = createServer(); http.createServer(function (req, res) { res.writeHead(200, {'Content-Type': 'text/plain'}); if (server) { server.close(); server = null; res.end('Shutting down...\n'); } else { server = createServer(); res.end('Starting up...\n'); } }).listen(8081); |
So when the server is ON, it’ll listen at port 8080 and return OK to any request coming to that port. Making a call to port 8081 will shut the server down and another call will enable it back again:
1 2 3 4 5 6 7 8 9 10 11 12 |
$ node server.js # switch to another terminal curl 127.0.0.1:8080 # OK curl 127.0.0.1:8081 # Shutting down... curl 127.0.0.1:8080 # curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused curl 127.0.0.1:8081 # Starting up... curl 127.0.0.1:8080 # OK |
Now let’s put that server.js into a Dockerfile with a health check, build an image and start it as a container:
1 2 3 4 5 6 7 8 9 |
FROM node COPY server.js / EXPOSE 8080 8081 HEALTHCHECK --interval=5s --timeout=10s --retries=3 CMD curl -sS 127.0.0.1:8080 || exit 1 CMD [ "node", "/server.js" ] |
1 2 3 4 5 6 |
$ docker build . -t server:latest # Lots, lots of output $ docker run -d --rm -p 8080:8080 -p 8081:8081 server # ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b $ curl 127.0.0.1:8080 # OK |
Created container ID starts with ec3
, which should be enough to identify it later, so now we can jump to health checks.
Monitoring container health status
Docker’s main command for checking container’s health is docker inspect
. It produces huge JSON in response, but the only part we interested in is its State.Health
property:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ docker inspect ec3 | jq '.[].State.Health' #{ # "Status": "healthy", # "FailingStreak": 0, # "Log": [ # { # "Start": "2017-06-27T04:07:03.975506353Z", # "End": "2017-06-27T04:07:04.070844091Z", # "ExitCode": 0, # "Output": "OK\n" # }, #... } |
Not surprisingly, current status is ‘healthy’ and we even can see health checks logs in Log
collection. However, after making a call to port 8081 and waiting for 3*5 seconds (to allow three checks to fail) the picture will change.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
$ curl 127.0.0.1:8081 # Shutting down... # 15 seconds later $ docker inspect ec3 | jq '.[].State.Health' #{ # "Status": "unhealthy", # "FailingStreak": 4, # "Log": [ # ... # { # "Start": "2017-06-27T04:16:27.668441692Z", # "End": "2017-06-27T04:16:27.740937964Z", # "ExitCode": 1, # "Output": "curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused\n" # } # ] #} |
I waited a little bit longer than 15 seconds, so the health check managed to fail 4 times in a row (FailingStreak
). And as expected, container’s status did change to ‘unhealthy’.
But as soon as at least one health check succeeds, Docker will put the container back to ‘healthy’ state:
1 2 3 4 |
$ curl 127.0.0.1:8081 # Starting up... $ docker inspect ec3 | jq '.[].State.Health.Status' # "healthy" |
Checking health status with Docker events
Along with inspecting container state directly, we also could’ve listen to docker events
:
1 2 3 |
$ docker events --filter event=health_status # 2017-06-27T00:23:03.691677875-04:00 container health_status: healthy ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b (image=server, name=eager_swartz) # 2017-06-27T00:23:23.998693118-04:00 container health_status: unhealthy ec36579aa452bf683cb17ee44cbab663d148f327be369821ec1df81b7a0e104b (image=server, name=eager_swartz) |
Docker events can be a little bit chatty that’s why I had to use the --filter
. The command itself won’t exit right away and stay running, printing out events as they come.
Health status and Swarm services
In order to try how health checks affect Swarm services, I temporarily turned local Docker instance to Swarm mode by docker swarm init
, and now I can do the following:
1 2 3 4 5 6 7 8 9 10 11 |
$ docker service create -p 8080:8080 -p8081:8081 \ --name server \ --health-cmd='curl -sS 127.0.0.1:8080' \ --health-retries=3 \ --health-interval=5s \ server #unable to pin image server to digest: errors: #denied: requested access to the resource is denied #unauthorized: authentication required #ohkvwbsk06vkjyx69434ndqij |
This puts a new service into the Swarm using locally built server
image. Docker wasn’t really happy with the fact that image is local and returned bunch of errors, but eventually it did return ID of newly created service:
1 2 3 |
docker service ls #ID NAME MODE REPLICAS IMAGE #ohkvwbsk06vk server replicated 1/1 server |
curl 127.0.0.1:8080
will be working again and sending request to 8081 will, as usual, shut the server down. However, this time after a short while port 8080 will start working again without explicitly enabling the server. The thing is that as soon as Swarm manager noticed that container became unhealthy and therefore the whole service was no longer meeting desired state (‘running’), it shut down container completely and started a new one. We actually can see the traces of that by examining tasks collection for our server
service:
1 2 3 4 |
$ docker service ps server #ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS #mt67hkhp7ycr server.1 server moby Running Running 50 seconds ago #pj77brhfhsjm \_ server.1 server moby Shutdown Failed about a minute ago "task: non-zero exit (137): do…" |
As a little back story, every single Swarm container has a task assigned to it. When container dies, corresponding task gets shut down as well and Swarm creates a new pair of a task and a container. docker service ps
displays the whole chain of tasks deaths and resurrections for given service and container. In our particular case server
‘s initial task with id pj77brhfhsjm
is marked as failed, and docker inspect
even says why:
1 2 |
$ docker inspect pj77 | jq '.[].Status.Err' # "task: non-zero exit (137): dockerexec: unhealthy container" |
“Unhealthy container”, that’s why. But bottom line is that the service as a whole automatically recovered from unhealthy state with barely noticeably downtime.
Summary
Docker health checks is a cute little feature that allows attaching shell command to container and use it for checking if container’s content is alive enough. For containers in Docker engine’s standalone mode it just adds healthy/unhealthy attribute to it (plus health_status
docker event when value changes), but in Swarm mode it actually will shutdown faulty container and create a new one with very little downtime.
Thanks for the healthcheck story 🙂
I think you need to replace ‘|| echo 1’ by ‘|| exit 1’ after the curl command in some places.
Oh, how did I miss that.. At least it was correct in a half of the places. Thank you!
thanks a ton !! I am creating a company Docker concept and missing healthcheck was total showstopper. Funny how all docs tell that docker is SPOF free, which it is, but running services on swarm is far from SPOF free without health check.