Distributed apps introduce a challenge that we usually could avoid in monolithic ones: how do we say that app is performing well? I’m not talking about it being user-friendly or providing business value. How do you tell that components of your distributed app are actually running? Which services are overutilized? Underutilized? Run out of disk space?
There’re tools to get that answers and collectd is one of them.
What is collectd
collectd is a lightweight daemon that collects time series of monitoring data from wherever it can (CPU, disks, memory, various sensors, OS counters) and writes it to wherever it was told to. It works on *nix systems only, but there’s its evil (commercial) twin brother for windows: ssc-serv. Unlike some other tools trying to be as smart as possible, collectd relies on its plugins to do all the job.
collectd plugins
There’re three main kinds of plugins:
There’s also few other types of plugins that kind of different from three above, but don’t really deserve a category on their own. For instance, there’s network plugin which allows connecting several collectd services together. Or there’re exec and python plugins, which make it possible to write shell or python scripts to provide your own metrics (e.g. collect number of Google Compute Engine instances in certain zone over time via gcloud util and bash). Finally, there’re few notification plugins, like notify_email.
But most important plugins are the ones that read the metrics and ones that store them.
Plugins for getting metrics
Whatever you can imagine as a metrics source probably has a plugin for that. There’re CPU, memory, disk, Apache/nginx, temperature, network interface, ping, processes, sensors, serial, SMART – legion of them. Though many of them are disabled by default, it’s unbelievably easy to turn them on or off.
If you open collectd.conf file ( /etc/collectd/collectd.conf ), uncommenting a plugin and restarting collectd service is usually enough.
1 2 3 4 5 6 7 8 9 10 11 12 |
#LoadPlugin dns #<Plugin dns> # Interface "eth0" # IgnoreSource "192.168.0.1" # SelectNumericQueryTypes false #</Plugin> LoadPlugin syslog <Plugin syslog> LogLevel info </Plugin> |
Plugins for writing metrics
Those plugins are destination for the data that collectd harvested, and some of them are making me extremely happy. You can write your data into CSV file, RRDtool, AMQP-compatible message queues (e.g. RabbitMQ), Kafka, Carbon and Graphite, http, mongo, redis and few others.
Collecting data from multiple hosts
If we’re talking about distributed app, there will be multiple hosts with multiple collectd instances on it. Making those hosts to store own metrics makes little to no sense. What does make sense is collecting that data in some common location (and putting a large dashboard on top of it). There’re few options to get that done:
- Use network plugin, which allows one collectd daemons to be metrics source for others.
- Publish data to AMQP message broker, like RabbitMQ or ActiveMQ. Not only collectd can write to a message queue, it also can read from it, so multiple daemons can publish their data to dedicated subscriber.
- Publish data to Carbon (more on that later).
Rendering graphs
Having text and binary files with tabular data is nice, but also useless. I can’t tell by looking at column with CPU readings if host is feeling itself good or bad. Looking at graph would help, but collectd doesn’t render. After all, it doesn’t have the word ‘render’ in it, it’s strictly ‘collect’. But there’re ways to do get the charts. Many ways.
- There’re collection3 and collectd-web front-ends for rendering RRD files that collectd creates. Personally, I look suspicious at everything that has cgi-bin folder in it, but if it gets the job done, why not.
- Any self-respecting spreadsheet can import CSV and add some graphs to it.
12345$ head ~/collectd/csv/myhost/memory/memory-free-2016-12-19#epoch,value#1482105608.337,398675968.000000#1482105618.337,398028800.000000#1482105628.338,398028800.000000 - Use RRD (Round-Robin Database) files and rrdtool. RRDtool deserves its own blog post, but main thing is that collectd’s default output is RRD file, and rrdtool (that collectd comes with, if installed via package manager) is capable of making nice charts. It’s a command line utility with somewhat cryptic arguments, but those arguments control every aspect of a graph.
- Carbon and Graphite-web are time-series data listener and its web UI for rendering. collectd has a plugin for writing metrics to Carbon service which results nicely organized data sources and graphs for them.
Installation (instead of conclusion)
Yes, I forgot about installation! Well, that’s easy: if you want to build collectd from sources, here’s the guide. On Ubuntu or Debian apt-get install collectd will do, and probably there’s something for CentOS and others. It starts automatically and has reasonable amount of enabled plugins, so nothing stops you from trying it right away, even in Docker container. Getting the graphs can be a little bit tricky in the beginning, as rrdtool has moderately steep learning curve and even web UIs sometimes are tricky to install, but it’s like a bicycle – once you get it, it gets directly to muscle memory and becomes impossible to forget.
5 thoughts on “Host monitoring with collectd”