So far we’ve been dealing with name-value kind of monitoring data. However, what works well for numeric readings isn’t necessarily useful for textual data. In fact, Grafana, Graphite and Prometheus are useless for other kind of monitoring records – logs and traces.
There’re many, many tools for dealing with those, but I decided to take a look at Elastic’s ELK stack: Elasticsearch, Logstash and Kibana – storage, data processor and visualization tool. And today we’ll naturally start with the first letter of the stack: “E”.
What’s Elasticsearch
Elasticsearch is fast, horizontally scalable open source search engine. It provides HTTP API for storing and indexing JSON documents and with default configuration it behaves a little bit like searchable NoSQL database.
Installation
Elasticsearch is written in Java, so installation is very easy: download the archive and launch bin/elasticsearch
in it. However, running it through official Docker container is even simpler: docker run -d -p9200:9200 elasticsearch
. Port 9200 is a front door, so let’s look what’s inside.
Looking around
Official guides usually use Kibana for running demo queries, but c’mon, it’s just HTTP and JSON, we don’t need separate tool for that when there’s terminal and curl
! Elasticsearch supposed to be listening at 9200 port, so let’s send blank request to it and see what happens:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ curl 127.0.0.1:9200 #{ # "name" : "e-wGdWV", # "cluster_name" : "elasticsearch", # "cluster_uuid" : "ZxPcxDlFTSu68zpY9foYiw", # "version" : { # "number" : "5.2.0", # "build_hash" : "24e05b9", # "build_date" : "2017-01-24T19:52:35.800Z", # "build_snapshot" : false, # "lucene_version" : "6.4.0" # }, # "tagline" : "You Know, for Search" #} |
Version 5.2.0
.. Seems to be the latest one.
There’re other queries we can run without adding any data. For instance, we can check node’s health status:
1 2 3 |
$ curl 127.0.0.1:9200/_cat/health?v #epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent #1486360312 05:51:52 elasticsearch yellow 1 1 5 5 0 0 5 0 - 50.0% |
Or get list of current indices:
1 2 |
$ curl 127.00.1:9200/_cat/indices?v #health status index uuid pri rep docs.count docs.deleted store.size pri.store.size |
Obviously, for brand new installation it has none. But as a number of unfamiliar words starts to climb, let’s take a look at elasticsearch glossary.
Elasticsearch glossary
So we’ve already mentioned node, which is a single running instance of elasticsearch with its own storage and settings. Even one node counts as a cluster, but having several of them in conjunction with index sharding (similar to Kafka topics partitioning) and replication would both decrease response time and increase index survival chances.
Term index itself describes a collection of documents. Your cluster can have as many of them as you want. Within the index you can categorize you documents by types – arbitrary names describing documents of similar structure, e.g. customers or paychecks. Finally, a document is a good old JSON.
Create, Read, Update, Delete
But enough of theory, it’s time to do something with the data.
Create
Adding new document to elasticsearch is as easy as HTTP POST request:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
$ curl -X POST 127.0.0.1:9200/monitor/logs?pretty -d '{ "kind": "info", "message": "The server is up and running" }' #{ # "_index" : "monitor", # "_type" : "logs", # "_id" : "AVoWblBE6fU5oFCNC7jY", # "_version" : 1, # "result" : "created", # "_shards" : { # "total" : 2, # "successful" : 1, # "failed" : 0 # }, # "created" : true #} |
We posted new { "kind": "info", "message": "..."}
document to index called monitor
and type named logs
. None of the latter two existed before that, but elasticsearch created them along with indexing the document. It also responded with JSON containing newly inserted document ID (_id
) and some other details. It’s also possible to provide own ID by using PUT request instead of POST and adding new ID to the URL, e.g. -X PUT monitor/logs/42
. Query string parameter ?pretty
is used only for formatting response JSON.
As not many people would actually enjoy inserting documents one by one, there’s also bulk insert option.
1 2 3 4 5 6 |
$ curl -X POST 127.0.0.1:9200/monitor/logs/_bulk -d ' { "index": {}} { "kind" : "warn", "message": "Using 90% of memory" } { "index": {}} { "kind": "err", "message": "OutOfMemoryException: Epic fail has just happened" } ' |
Bulk request requires two JSONs per one document. The first one describes bulk operation kind (in our case – “index”) and the second one is the document itself.
Read
Now, when we have something in the index, we can perform simple search to read the documents back. Default elasticsearch settings store the full copy of a document along with its index, so in such case search with empty criteria would behave like SELECT *
statement:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
curl 127.0.0.1:9200/monitor/_search?pretty #{ # ......... # "hits" : { # "total" : 3, # "max_score" : 1.0, # "hits" : [ # { # "_index" : "monitor", # "_type" : "logs", # "_id" : "AVoWe_7d6fU5oFCNC7jb", # "_score" : 1.0, # "_source" : { # "kind" : "err", # "message" : "OutOfMemoryException: Epic fail has just happened" # } # }, # { # "_index" : "monitor", # "_type" : "logs", # "_id" : "AVoWe_7d6fU5oFCNC7ja", # "_score" : 1.0, # "_source" : { # "kind" : "warn", # "message" : "Using 90% of memory" # } # }, # { # "_index" : "monitor", # "_type" : "logs", # "_id" : "AVoWblBE6fU5oFCNC7jY", # "_score" : 1.0, # "_source" : { # "kind" : "info", # "message" : "The server is up and running" # } # } # ] # } #} |
It’s also possible to get single document by its ID:
1 2 3 4 5 6 7 8 |
curl 127.0.0.1:9200/monitor/logs/AVoWblBE6fU5oFCNC7jY?pretty #{ # ... # "_source" : { # "kind" : "info", # "message" : "The server is up and running" # } #} |
Update
Similarly, knowing document ID we can update it. “Epic fail has just happened” message for OutOfMemoryException is probably saying less than it should, so it’s better to update it:
1 2 3 4 |
$ curl -X POST 127.0.0.1:9200/monitor/logs/AVoWe_7d6fU5oFCNC7jb -d ' { "kind": "err", "message": "OutOfMemoryException: The server process used all available memory" }' |
However, under the hood elasticsearch doesn’t update the document but rather replaces it with the new one, keeping the same ID.
Delete
When you need to get rid of something, HTTP DELETE will do the trick. E.g. curl -X DELETE 127.0.0.1:9200/monitor/logs/AVoWe_7d6fU5oFCNC7jb
Search
But many NoSQL databases are capable of storing and retrieving JSON documents. The real power of elasticsearch is in search (duh). There’re two approaches for searching for data: the REST Request API for simple queries and more sophisticated Query DSL.
REST Request API simply means there’s additional argument to HTTP GET request:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
$ curl -s 127.0.0.1:9200/monitor/_search?q=memory | json_pp #{ .... # "hits" : { # "hits" : [ # { # "_id" : "AVoWe_7d6fU5oFCNC7ja", # "_source" : { # "kind" : "warn", # "message" : "Using 90% of memory" # }, # .... # "_score" : 0.2824934 # }, # { # "_id" : "AVoWe_7d6fU5oFCNC7jb", # "_source" : { # "kind" : "err", # "message" : "OutOfMemoryException: The server process used all available memory" # }, # ... # "_score" : 0.27233246 # } # ], # "total" : 2, # "max_score" : 0.2824934 # ... #} |
There’s not much you can insert into query string – a search term, maybe sort=
instruction, and that’s it. Query DSL, on the other hand, is full blown domain-specific language, that has numerous search arguments, boolean expressions, result filters – all sorts of things to help in finding what we need.
Query DSL search is also a HTTP GET request, but with a little bit tricker syntax. If we’d want to find non-critical log messages that mention memory status, we could use something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
$ curl -s 127.0.0.1:9200/monitor/_search -d ' { "query": { "bool": { "must": [ { "match": { "kind":"info warn" }}, { "match": { "message":"memory" }} ] } } }' | json_pp #{ # ... # "hits" : { # "total" : 1, # "hits" : [ # { # "_type" : "logs", # "_index" : "monitor", # "_source" : { # "message" : "Using 90% of memory", # "kind" : "warn" # }, # "_score" : 0.5753642, # "_id" : "AVoWe_7d6fU5oFCNC7ja" # } # ], # "max_score" : 0.5753642 # } #} |
Aggregate
In addition to searching capabilities elasticsearch can aggregate stuff. Aggregation is a huge topic on its own, but to get a feeling of how it looks, here’s how we’d get statistics of logs by their kind:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
curl -s 127.0.0.1:9200/monitor/_search -d ' { "size": 0, "aggs": { "group_by_kind": { "terms": { "field" : "kind.keyword" } } } }' | json_pp #{ # "aggregations" : { # "group_by_kind" : { # "sum_other_doc_count" : 0, # "buckets" : [ # { # "key" : "err", # "doc_count" : 1 # }, # { # "doc_count" : 1, # "key" : "info" # }, # { # "key" : "warn", # "doc_count" : 1 # } # ], # ... #} |
Because _search
URL will do both searching and aggregation and we didn’t provide search criteria (so query would return everything), we added "size": 0
parameter to prevent search results from showing. And the rest is quite obvious.
Conclusion
I would say we just scratched the surface of elasticsearch, but we did much less that that. We accidentally dropped a feather on it, sneezed and the air flow blew the feather and a few surface molecules that stuck to it away. Updating documents by submitting the script, document schemas, filters, complex search and aggregation queries, clusters, documents analysis – we covered none of that.
But we did cover enough to get the feeling what the tool is: easy to use search engine with convenient API and bazillion of useful data exploring features to google. Next time we’ll take a look at how to fill it in with textual monitoring data with the help of Logstash.
2 thoughts on “Quick intro to Elasticsearch”