Last time we talked out about Elaticsearch – a hybrid of NoSQL database and a search engine. Today we’ll continue with Elastic’s ELK stack and will take a look at the tool called Logstash.
What’s Logstash
Logstash is data processing pipeline that takes raw data (e.g. logs) from one or more inputs, processes and enriches it with the filters, and then writes results to one or more outputs. Elastic recommends writing the output to Elasticsearch, but it fact it can write to anything: STDOUT, WebSocket, message queue.. you name it.
Installation
Having Java installed, you can simply download the archive with Logstash, unpack it and then launch bin/logstash -f logstash.conf
. You’ll need, however, to provide a configuration file. But for “hello world” apps something like input {STDIN {}} output {STDOUT {}}
will do. Using Docker container is also an option, which I like the most:
1 |
docker run -it logstash -e 'input { stdin { } } output { stdout { } }' |
Configuration, codecs and “Hello World!”
Ok, we’ve installed it, now what? As I mentioned earlier, Logstash requires some default configuration before it could start, but there’s simple config we could try right away:
1 2 3 4 5 6 7 |
input { stdin { } } output { stdout { } } |
The config basically says: take whatever user types, do your default magic and write the result to the console.
Hello world
Fair enough. I’ve got some Apache2 logs locally, so let’s start Logstash with simplest configuration possible and feed some of those logs to it:
1 2 3 4 5 |
$ bin/logstash -e 'input { stdin { } } output { stdout {} }' #.... #05:38:59.948 [Api Webserver] INFO logstash.agent - Successfully started Logstash API endpoint {:port=>9600} $ 172.17.0.1 - - [11/Feb/2017:04:41:22 +0000] "GET / HTTP/1.1" 200 3525 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" #2017-02-13T05:39:12.684Z 269a27a16415 172.17.0.1 - - [11/Feb/2017:04:41:22 +0000] "GET / HTTP/1.1" 200 3525 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" |
Well, that wasn’t quite impressive. Logstash started, I fed it the line (4) of Apache2 access log, and it reported the same line back (5) with the timestamp 2017-02-13T05:39:12.684Z
and the host name 269a27a16415
in front of it.
Codecs
We at least can make the output a little bit more readable. Both inputs and outputs can accept codecs – data stream formatters. There’re all sorts of those for zipping, JSON parsing, etc. In our case rubydebug
would add indentation to the output and highlight how exactly Logstash ‘sees’ the data it deals with.
The configuration requires a tiny change, but it’ll make the output much more readable:
1 2 3 4 5 6 |
... output { stdout { codec => rubydebug } } |
1 2 3 4 5 6 7 8 9 |
$ bin/logstash -e 'input { stdin { } } output { stdout {codec => rubydebug} }' #.... $ 172.17.0.1 - - [11/Feb/2017:04:41:22 +0000] "GET / HTTP/1.1" 200 3525 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" # { # "@timestamp" => 2017-02-12T05:24:21.270Z, # "@version" => "1", # "host" => "31190306c1eb", # "message" => "172.17.0.1 - - [11/Feb/2017:04:41:22 +0000] \"GET / HTTP/1.1\" 200 3525 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36\"" #} |
Writing output to Elasticsearch
Before we move any further, let’s try one more thing. I mentioned before that there can be more than one output, as well as we can write data directly to Elasticsearch. It’s actually very easy to prove two points in one go.
I have Elasticsearch node running at 172.19.0.2, port 9200, and connecting Logstash to it would require changing config just a little bit:
1 2 3 4 5 6 7 8 9 |
... output { stdout { codec => rubydebug } elasticsearch { hosts => ["172.19.0.2:9200"] } } |
Now let’s restart Logstash instance, feed Apache2 log line to it one more time, but this time let’s check what’s happening with Elasticsearch node:
1 2 |
$ curl 127.0.0.1:9200/_cat/indices #yellow open logstash-2017.02.12 rgQub7hsS0qq-FBj3HA2Rg 5 1 5 0 20.1kb 20.1kb |
Well, someone created logstash-2017.02.12
. If we run a search query against this index, we’ll get pretty much the same data as we saw in the console output before.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
$ curl 127.0.0.1:9200/logstash-2017.02.12/_search?pretty #{ #... # "hits" : [ # { # "_index" : "logstash-2017.02.12", # "_type" : "logs", #... # "_source" : { # "@timestamp" : "2017-02-12T05:24:21.272Z", # "@version" : "1", # "host" : "31190306c1eb", # "message" : "172.17.0.1 - - [11/Feb/2017:04:41:22 +0000] \"GET /icons/ubuntu-logo.png HTTP/1.1\" 200 3623 \"http://localhost/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36\"" #... #} |
Processing data with filters
So far Logstash was able to route log data from one point to another, but did absolutely nothing to make it usable. It’s time to change it. Logstash config can have one more section – filters, and this is what makes all of its magic to happen.
It looks like there’re filters for everything: removing sensitive data from logs, aggregating numeric metrics, performing DNS lookup, adding and removing fields, parsing user agent and, so forth. In our case we need to parse unstructured log string into the set of separate fields: IP address, URL, user agent, etc. Filter that can do such kind of job is called grok.
Grok filter
Grok is the main way Logstash gives the structure to arbitrary text. It comes with a set of patterns, which you combine in a string that matches the one you need to parse. For instance, if my logs are going to look like this:
1 2 |
0 127.0.0.1 /default.html 1 172.0.0.9 / |
I’d came up with the following pattern string to parse it:
1 |
%{NUMBER:sequence} %{IP:client} ${URIPATHPARAM:target} |
As log lines would go through the grok, they’d be converted to objects with three properties: sequence
, client
and target
– exactly as we defined in the pattern string.
There’s also a set of predefined patterns that match well-known log formats. For example, COMBINEDAPACHELOG
pattern matches, you guessed it, Apache2 logs. Let’s give it a try.
Config:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
input { stdin {} } filter { grok { match => { "message" => "%{COMBINEDAPACHELOG}" } } } output { stdout { codec => rubydebug } elasticsearch { hosts => ["172.19.0.2:9200"] } } |
And the output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{ "request" => "/", "agent" => "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36\"", "auth" => "-", "ident" => "-", "verb" => "GET", "message" => "172.17.0.1 - - [11/Feb/2017:04:41:22 +0000] \"GET / HTTP/1.1\" 200 3525 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36\"", "referrer" => "\"-\"", "@timestamp" => 2017-02-12T05:41:18.687Z, "response" => "200", "bytes" => "3525", "clientip" => "172.17.0.1", "@version" => "1", "host" => "31190306c1eb", "httpversion" => "1.1", "timestamp" => "11/Feb/2017:04:41:22 +0000" } |
Result!
grok settings
But we can do better than this. Grok has number of other settings, including remove_field
. As you probably noticed, message
property of grokked output still contains complete log line, which at this point is a redundancy. We can safely remove it with the following change in the config:
1 2 3 4 5 6 7 8 9 10 |
... filter { grok { match => { "message" => "%{COMBINEDAPACHELOG}" } remove_field => [ "message" ] } } ... |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "request" => "/icons/ubuntu-logo.png", "agent" => "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36\"", "auth" => "-", "ident" => "-", "verb" => "GET", "referrer" => "\"http://localhost/\"", "@timestamp" => 2017-02-12T05:42:55.999Z, "response" => "200", "bytes" => "3623", "clientip" => "172.17.0.1", "@version" => "1", "host" => "31190306c1eb", "httpversion" => "1.1", "timestamp" => "11/Feb/2017:04:41:22 +0000" } |
Now it’s much better. But we still can do better than this.
geoip filter
As name suggests, geoip filter maps IP address to world map coordinates and street address. I’ve changed one of the logs to have my external IP address instead of 172.17.0.1
, changed the config one more time…
1 2 3 4 5 6 7 8 9 10 |
... filter { grok { ... } geoip { source => "clientip" } } ... |
…and just take a look at that beauty:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
{ "request" => "/", "agent" => "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\"", "geoip" => { "timezone" => "America/Toronto", "latitude" => 43.4464, "continent_code" => "NA", "city_name" => "Oakville", "country_code2" => "CA", "country_name" => "Canada", "country_code3" => "CA", "region_name" => "Ontario", "location" => [ [0] -79.7593, [1] 43.4464 ], "postal_code" => "L6M", "longitude" => -79.7593, "region_code" => "ON" }, "auth" => "-", "ident" => "-", ... |
Isn’t that amazing? geoip picked clientip
property produced by grok and added few more properties on its own. And I didn’t even had to download any external data for that to work!
And now after Logstash enriched the data, we can go to Elasticsearch and learn from it. For instance, did we have visitors from Oakville?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$curl -s 127.0.0.1:9200/logstash-2017.02.12/_search?q=oakville | json_pp #{ # ... # "hits" : [ # { # "_id" : "AVow3zOW6fU5oFCNC7kH", # "_score" : 1.5404451, # "_type" : "logs", # "_index" : "logstash-2017.02.12", # "_source" : { # "geoip" : { # ... # "city_name" : "Oakville", # }, # "response" : "200", # "@timestamp" : "2017-02-12T05:50:18.333Z", # "agent" : "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\"", # "request" : "/", #... # } # } # ], #... |
Yes we did. How many HTTP errors we had yesterday? Are there any attempts to access forbidden URLs? How many? From where? Now we’re free to ask anything.
Summary
Logstash is that kind of tool which looks much more interesting after dealing with it rather than after reading its description. “Logstash is server-side data processing pipeline…” Yeah, whatever. But after you try putting those small bricks of inputs, outputs and filters together, suddenly it all makes sense and now I want to try different combinations of settings to see what else can I get from my data.
Logstash works well with Elasticsearch, but both inputs and outputs can read and write to huge variety of sources, starting from message queues and ending with raw TCP sockets.
But there’s still one piece missing from the picture – UI. After all, analyzing logs by running command line queries isn’t that productive. We’ll solve that the next time by looking at Kibana.
I want to use your log stash diagram on my site I will credit your blog or your name would that be ok
Hi Tamir, yes, that’s OK