Hands-on Tutorials

Elasticsearch (ES) is a distributed search engine designed for scalability and redundancy. ES has become increasingly popular in recent years because of its robustness and scalability in machine learning, storage and handling of large volumes of data for analytics, and many other applications. In this blog post, we’ll go over four things that data practitioners need to master to get started with ES: provisioning an ES cluster, designing indices, writing queries, and optimising indices.
1. Setting up an ES cluster
First, you need to know how to set up an ES cluster.
Elasticsearch is a distributed search and analytics engine that is part of the Elastic Stack. Queries that are run against an ES index, as well as the data in these indices, is distributed across nodes. You add these different nodes (i.e. servers) when you set up an Elasticsearch cluster. ES handles the distribution of the queries and data across the nodes in your cluster automatically. This ensures scalability and high availability.
You can provision an ES cluster through one of many managed service providers, including Amazon Web Services’ (AWS) Amazon Elasticsearch Service, Bonsai, or Elastic.co. You can also run ES locally on your machine. Any decision on how and where to provision, deploy, and manage your cluster (and where) depends on what kind of solution you’re building and what you intend to use ES for (more on this below).
On a sidenote: AWS has recently (April 2021) released its own "OpenSearch" project — an open source fork of Elasticsearch of Kibana. This project includes Opensearch (based on Elasticsearch 7.10.2) and OpenSearch Dashboards (based on Kibana 7.10.2). AWS plans to combine its offering of Elasticsearch and OpenSearch under a new name: Amazon OpenSearch Service.
2. Designing indices
Second, designing ES indices (that is, defining mappings) is key to mastering ES. Here, several concepts are key: documents, fields, indices, mappings, nodes, and shards.
Documents, fields, and indices
Data in Elasticsearch is stored in indices in the form of JSON objects called "documents". A document contains fields. These fields are key-value pairs that can contain a value (e.g. a string, integer, or a boolean) or a nested structure. Indices, in turn, **** denote two things: they are logical groupings of data that follow a schema, but they also are the _physica_l organisation of the data through shards – bringing us to the next key concepts in ES: nodes and shards.
Nodes and shards
When we say that ES is built for redundancy, what do we actually mean? In concrete terms, the ES framework works through nodes and shards, and with primary shards and replicas. Each index consists of one or more physical shards. These physical shards form a logical group, where each shard is a "self-contained index".
The data that is stored in an index (i.e. the documents) is partitioned across shards, and shards are distributed across nodes. Each document consists of a "primary shard" and a "replica shard". This design means that data resides in multiple places and can therefore be queried fast, and is available from multiple places (i.e. shards). This duplication of critical components is what makes ES designed for redundancy.
Selecting the right number and size of shards
The design of nodes and shards also has the advantage that data in different shards can be processed in parallel. Generally, to the extent that your CPU and memory allow, you can increase the speed of search by adding more shards. Adding shards does however come with some overhead, so it’s important that you think carefully about the appropriate number of shards (and their size) in your cluster.
Defining your own ES mappings
Writing good ES mappings requires some practice (although those familiar with designing schemas in other database frameworks will likely find the transition fairly easy). Here, it’s important to learn to define your own mappings instead of using dynamic field mapping, and to be aware of the effect of your choice of field type on the size of your index and query flexibility.
Elasticsearch has a feature called "dynamic field mapping". When enabled, dynamic field mapping automatically infers the mapping of your documents when you write them to your index. However, this does not necessarily give you the result you want because it may not select the most optimal field types. You’ll usually want to define your own mapping since the choice of field types determine the size of your index and the flexibility you’ll have in querying data.
Text versus keyword field types
For example, you could use the text
field type for a string value. With this field type, the string is broken down into individual terms when your document is indexed, giving the flexibility of partial matching when querying. Another option for a string value would be the keyword
field type. But, this type is not analysed (or rather: "tokenised") upon indexing, and will limit your querying options to exact matching.
Where’s my array field?
One additional thing to be aware of when writing mappings is that ES does not have an array field type. If your documents do include arrays, you have to use the type of the values in that array. For example, for an array of integers the correct ES field type is integer
(see this page for a complete overview of ES field types).
For an in-depth overview of how to define an ES mapping, and populate your indices, check out the first section of my earlier blog post on Creating and Managing ES indices with Python.
3. Writing queries
A third thing you need to learn is writing effective queries. There are several concepts to master here: DSL, relevance scores, and filter versus query.
Searching with Domain Specific Language (DSL)
Elasticsearch is all about search (well, mostly). A query for Elasticsearch is written in Elasticsearch Domain Specific Language (DSL). DSL is a so-called "Abstract Syntax Tree (AST)" of queries, and is based on JSON. When a query is executed, ES computes a relevance score that tells us how well the documents match your query. The results (i.e. documents) returned by ES are sorted by the relevance score, **** which is found in the _score
field in the results.
DSL includes two types of clauses. The first type, leaf query clauses, are used to search on a precise value in a specific field (e.g. when you use a term
query to find documents that correspond with a precise value). The second type are compound query clauses – queries that are used to combine multiple queries in a logical way (potentially of different types, including leaf and compound queries). Compound query clauses can also be used to change the behaviour of these queries.
Query and filter contexts
An ES query may include a query and a filter context. The filter context serves to exclude documents that do not match certain conditions that you’ve set in the syntax. The query and filter contexts therefore also have a different "relationship" with the relevance score: the filter context does not affect it, while the bool context does contribute to its value.
Deciding whether to use a filter or query context is an important part of writing effective queries. In general, filters are cheaper (computationally) as frequently used filters are automatically cached to enhance performance. However, filters are used to search on conditions that are unambiguous, that is, that have a clear "yes" or "no" answer. For example, filters are useful if you want to identify documents where a specific field has a precise numeric value.
In other cases, the question of what "matches" your search conditions is not as clear-cut. For these "ambiguous" cases, the query context should be used. An example of such a search case is when you’re looking for a partial book title in an index containing book metadata using a match query (we’re assuming you’ve forgotten the exact title of the book you’re looking for). Here, different documents (with different title field values) are potentially more relevant to your search than others. In this specific example, we’re dealing with full-text search, that is, queries to search analysed text fields (the "match" query is the standard query for full-text search).
For an in-depth overview of writing ES queries and querying ES with the Python Elasticsearch Client, check out my earlier blog post on Creating and Managing ES indices with Python.
4. Optimising your indices
Finally, you need to master strategies for optimising your ES indices for their designated purpose.
Different indices may serve different purposes and therefore require specific optimisation strategies. For example, if you are using ES to process and store large volumes of data, you may have to optimise for disk usage. Here, you’ll need to learn how to define mappings effectively (for example, avoid using dynamic mapping, and use the smallest possible field types), and how to optimise for costs using a hot-warm-cold architecture with index lifecycle management (ILM) and automatic index rollover.
For other applications, you may need to optimise for search speed, in which case you may need to look into index sorting, getting faster drives or faster CPUs (depending on the kind of search), or more effective document modelling (e.g. avoiding joins).
For an in-depth overview of optimising your ES indices for disk usage, check out my earlier blog post on Optimising Disk Usage in Elasticsearch.
Elasticsearch is an extremely versatile framework: it can be used for, among many other things, storage, machine learning, and for analysing logs and events. ES is without a doubt a useful framework to add to your Data Science toolkit. In this blog post, we’ve covered some of Elasticsearch’s core concepts and principles to help you get started with using ES in your own applications.
Thanks for reading!
If you liked this article, here are some other articles you may enjoy:
How to Build a Relational Database from CSV Files Using Python and Heroku
Creating and Managing Elasticsearch Indices with Python
Disclaimer: "Elasticsearch" and "Kibana" are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. Description and/or use of any third-party services and/or trademarks in this blog post should not be seen as endorsement for or by their respective rights holders.
Please read this disclaimer carefully before relying on any of the content in my articles on Medium.com.