Processing logs with Logstash

Last time we talked out about Elaticsearch – a hybrid of NoSQL database and a search engine. Today we’ll continue with Elastic’s ELK stack and will take a look at the tool called Logstash.

What’s Logstash

Logstash is data processing pipeline that takes raw data (e.g. logs) from one or more inputs, processes and enriches it with the filters, and then writes results to one or more outputs. Elastic recommends writing the output to Elasticsearch, but it fact it can write to anything: STDOUT, WebSocket, message queue.. you name it.

Logstash diagram

Installation

Having Java installed, you can simply download the archive with Logstash, unpack it and then launch bin/logstash -f logstash.conf. You’ll need, however, to provide a configuration file. But for “hello world” apps something like input {STDIN {}} output {STDOUT {}} will do. Using Docker container is also an option, which I like the most:

Configuration, codecs and “Hello World!”

Ok, we’ve installed it, now what? As I mentioned earlier, Logstash requires some default configuration before it could start, but there’s simple config we could try right away:

The config basically says: take whatever user types, do your default magic and write the result to the console.

Hello world

Fair enough. I’ve got some Apache2 logs locally, so let’s start Logstash with simplest configuration possible and feed some of those logs to it:

Well, that wasn’t quite impressive. Logstash started, I fed it the line (4) of Apache2 access log, and it reported the same line back (5) with the timestamp 2017-02-13T05:39:12.684Z and the host name 269a27a16415 in front of it.

Codecs

We at least can make the output a little bit more readable. Both inputs and outputs can accept codecs – data stream formatters. There’re all sorts of those for zipping, JSON parsing, etc. In our case rubydebug would add indentation to the output and highlight how exactly Logstash ‘sees’ the data it deals with.

The configuration requires a tiny change, but it’ll make the output much more readable:

Writing output to Elasticsearch

Before we move any further, let’s try one more thing. I mentioned before that there can be more than one output, as well as we can write data directly to Elasticsearch. It’s actually very easy to prove two points in one go.

I have Elasticsearch node running at 172.19.0.2, port 9200, and connecting Logstash to it would require changing config just a little bit:

Now let’s restart Logstash instance, feed Apache2 log line to it one more time, but this time let’s check what’s happening with Elasticsearch node:

Well, someone created logstash-2017.02.12. If we run a search query against this index, we’ll get pretty much the same data as we saw in the console output before.

Processing data with filters

So far Logstash was able to route log data from one point to another, but did absolutely nothing to make it usable. It’s time to change it. Logstash config can have one more section – filters, and this is what makes all of its magic to happen.

It looks like there’re filters for everything: removing sensitive data from logs, aggregating numeric metrics, performing DNS lookup, adding and removing fields, parsing user agent and, so forth. In our case we need to parse unstructured log string into the set of separate fields: IP address, URL, user agent, etc. Filter that can do such kind of job is called grok.

Grok filter

Grok is the main way Logstash gives the structure to arbitrary text. It comes with a set of patterns, which you combine in a string that matches the one you need to parse. For instance, if my logs are going to look like this:

I’d came up with the following pattern string to parse it:

As log lines would go through the grok, they’d be converted to objects with three properties: sequence, client and target – exactly as we defined in the pattern string.

There’s also a set of predefined patterns that match well-known log formats. For example, COMBINEDAPACHELOG pattern matches, you guessed it, Apache2 logs. Let’s give it a try.

Config:

And the output:

Result!

grok settings

But we can do better than this. Grok has number of other settings, including remove_field. As you probably noticed, message property of grokked output still contains complete log line, which at this point is a redundancy. We can safely remove it with the following change in the config:

Now it’s much better. But we still can do better than this.

geoip filter

As name suggests, geoip filter maps IP address to world map coordinates and street address. I’ve changed one of the logs to have my external IP address instead of 172.17.0.1 , changed the config one more time…

…and just take a look at that beauty:

Isn’t that amazing? geoip picked clientip property produced by grok and added few more properties on its own. And I didn’t even had to download any external data for that to work!

And now after Logstash enriched the data, we can go to Elasticsearch and learn from it. For instance, did we have visitors from Oakville?

Yes we did. How many HTTP errors we had yesterday? Are there any attempts to access forbidden URLs? How many? From where? Now we’re free to ask anything.

Summary

Logstash is that kind of tool which looks much more interesting after dealing with it rather than after reading its description. “Logstash is server-side data processing pipeline…” Yeah, whatever. But after you try putting those small bricks of inputs, outputs and filters together, suddenly it all makes sense and now I want to try different combinations of settings to see what else can I get from my data.

Logstash works well with Elasticsearch, but both inputs and outputs can read and write to huge variety of sources, starting from message queues and ending with raw TCP sockets.

But there’s still one piece missing from the picture – UI. After all, analyzing logs by running command line queries isn’t that productive. We’ll solve that the next time by looking at Kibana.

3 thoughts on “Processing logs with Logstash

Leave a Reply

Your email address will not be published. Required fields are marked *