Elasticsearch features with the convenience of pandas

Elasticsearch for Data Science just got way easier

Eland is a brand new python package that bridges the gap between Elasticsearch and the Data Science ecosystem.

Mateus Picanço
Towards Data Science
10 min readJul 13, 2020

--

Elasticsearch is a feature-rich, open-source search engine built on top of Apache Lucene, one of the most important full-text search engines on the market.

Elasticsearch is best known for the vast and versatile REST API experience it provides, including efficient wrappers for full-text search, sorting, and aggregation tasks, making it a lot easier to implement such capabilities in existing backends without the need for complex re-engineering.

Ever since its introduction in 2010, Elasticsearch gained a lot of traction in the software engineering domain, and by 2016 it became the most popular enterprise search-engine software stack according to DBMS knowledge base DB-engines, surpassing the industry-standard Apache Solr (which is also built on top of Lucene).

Google Trends interest for Elasticsearch since its release in 2010 (Worldwide)
Google Trends data for Elasticsearch since its release in 2010

One of the things that makes Elasticsearch so popular is the ecosystem it garnered. Engineers worldwide developed open-source Elasticsearch integrations and extensions, and many of these projects were absorbed by Elastic (the company behind the Elasticsearch project) as part of their stack.

Some of the projects were Logstash (data processing pipeline, commonly used for parsing text-based files) and Kibana (visualization layer built on top of Elasticsearch), leading towards the now widely adopted ELK (Elasticsearch, Logstash, Kibana) stack.

The ELK stack quickly gained notoriety due to its impressive set of possible applications across emerging and consolidated tech domains, such as DevOps, Site-Reliability Engineering, and, most recently, Data Analytics.

But what about Data Science?

Chances are that if you're a data scientist reading this article and have Elasticsearch as part of your employer's tech stack, you might have had some problems trying to use all the features Elasticsearch provides for data analysis and even for simple machine learning tasks.

Data scientists are generally not used to NoSQL database engines for everyday tasks or even relying on complex REST APIs for analysis. Dealing with large amounts of data using Elasticsearch's low-level python clients, for example, is also not that intuitive and has somewhat of a steep learning curve for someone coming from a field different from SWE.

Although Elastic made significant efforts in enhancing the ELK stack for Analytics and Data Science use cases, it still lacked an easy interface with the existing Data Science ecosystem (pandas, NumPy, scikit-learn, PyTorch, and other popular libraries).

In 2017, Elastic took its first step towards the data science field and, as an answer to the growing popularity of Machine Learning and predictive technologies in the software industry, released their first ML-capable X-pack (extension pack) for the ELK stack, adding Anomaly Detection and other unsupervised ML tasks to its features. Not long after that, Regression and Classification models (1) were also added to the set of ML tasks available in the ELK stack.

Last week another step towards Elasticsearch achieving widespread adoption in the data science industry, with the release of Eland, a brand new Python Elasticsearch client and toolkit with a powerful (and familiar) pandas-like API for analysis, ETL, and Machine Learning.

Eland: Elastic and Data

Eland enables data scientists to efficiently use the already robust Elasticsearch analysis and ML capabilities without requiring a deep knowledge of Elasticsearch and its many intricacies.

Features and concepts from Elasticsearch were translated into a much more recognizable setting. For instance, an Elasticsearch index, with its documents, mappings, and fields, becomes a dataframe, with rows and columns, much like we are used to seeing when using pandas.

# Importing Eland and low-level Elasticsearch clients for comparison
import eland as ed
from eland.conftest import *
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Import pandas and numpy for data wrangling
import pandas as pd
import numpy as np

# For pretty-printing
import json

Typical data science use cases such as reading an entire Elasticsearch index into a pandas dataframe for Exploratory Data Analysis or training an ML model would usually require some not-so-efficient shortcuts.

# name of the index we want to query
index_name = 'kibana_sample_data_ecommerce'

# instantiating client connect to localhost by default
es = Elasticsearch()

# defining the search statement to get all records in an index
search = Search(using=es, index=index_name).query("match_all")

# retrieving the documents from the search
documents = [hit.to_dict() for hit in search.scan()]

# converting the list of hit dictionaries into a pandas dataframe:
df_ecommerce = pd.DataFrame.from_records(documents)
# visualizing the dataframe with the results:
df_ecommerce.head()['geoip']
0 {'country_iso_code': 'EG', 'location': {'lon':...
1 {'country_iso_code': 'AE', 'location': {'lon':...
2 {'country_iso_code': 'US', 'location': {'lon':...
3 {'country_iso_code': 'GB', 'location': {'lon':...
4 {'country_iso_code': 'EG', 'location': {'lon':...
Name: geoip, dtype: object
# retrieving a summary of the columns in the dataset:
df_ecommerce.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675 entries, 0 to 4674
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 4675 non-null object
1 currency 4675 non-null object
2 customer_first_name 4675 non-null object
3 customer_full_name 4675 non-null object
4 customer_gender 4675 non-null object
5 customer_id 4675 non-null int64
6 customer_last_name 4675 non-null object
7 customer_phone 4675 non-null object
8 day_of_week 4675 non-null object
9 day_of_week_i 4675 non-null int64
10 email 4675 non-null object
11 manufacturer 4675 non-null object
12 order_date 4675 non-null object
13 order_id 4675 non-null int64
14 products 4675 non-null object
15 sku 4675 non-null object
16 taxful_total_price 4675 non-null float64
17 taxless_total_price 4675 non-null float64
18 total_quantity 4675 non-null int64
19 total_unique_products 4675 non-null int64
20 type 4675 non-null object
21 user 4675 non-null object
22 geoip 4675 non-null object
dtypes: float64(2), int64(5), object(16)
memory usage: 840.2+ KB
# getting descriptive statistics from the dataframe
df_ecommerce.describe()
png

The procedure described above would lead us to get all the documents in the index as a list of dictionaries, only to load them into a pandas dataframe. That means having both the documents themselves and the resulting dataframe in memory at some point in the process (2). This procedure would not always be feasible for Big Data applications, and exploring the dataset in the Jupyter Notebook environment could become very complicated and fast.

Eland enables us to perform very similar operations to the ones described above, without any of the friction involved in adapting them to the Elasticsearh context, while still using Elasticsearch aggregation speed and search features.

# loading the data from the Sample Ecommerce data from Kibana into Eland dataframe:
ed_ecommerce = ed.read_es('localhost', index_name)
# visualizing the results:
ed_ecommerce.head()
png

As an added feature that would require a bit more wrangling on the pandas' side, the field geoip (which is a nested JSON object in the index) was seamlessly parsed into columns in our dataframe. We can see that by calling the .info() method on the Eland dataframe.

# retrieving a summary of the columns in the dataframe:
ed_ecommerce.info()
<class 'eland.dataframe.DataFrame'>
Index: 4675 entries, jyzpQ3MBG9Z35ZT1wBWt to 0SzpQ3MBG9Z35ZT1yyej
Data columns (total 45 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 4675 non-null object
1 currency 4675 non-null object
2 customer_birth_date 0 non-null datetime64[ns]
3 customer_first_name 4675 non-null object
4 customer_full_name 4675 non-null object
5 customer_gender 4675 non-null object
6 customer_id 4675 non-null object
7 customer_last_name 4675 non-null object
8 customer_phone 4675 non-null object
9 day_of_week 4675 non-null object
10 day_of_week_i 4675 non-null int64
11 email 4675 non-null object
12 geoip.city_name 4094 non-null object
13 geoip.continent_name 4675 non-null object
14 geoip.country_iso_code 4675 non-null object
15 geoip.location 4675 non-null object
16 geoip.region_name 3924 non-null object
17 manufacturer 4675 non-null object
18 order_date 4675 non-null datetime64[ns]
19 order_id 4675 non-null object
20 products._id 4675 non-null object
21 products.base_price 4675 non-null float64
22 products.base_unit_price 4675 non-null float64
23 products.category 4675 non-null object
24 products.created_on 4675 non-null datetime64[ns]
25 products.discount_amount 4675 non-null float64
26 products.discount_percentage 4675 non-null float64
27 products.manufacturer 4675 non-null object
28 products.min_price 4675 non-null float64
29 products.price 4675 non-null float64
30 products.product_id 4675 non-null int64
31 products.product_name 4675 non-null object
32 products.quantity 4675 non-null int64
33 products.sku 4675 non-null object
34 products.tax_amount 4675 non-null float64
35 products.taxful_price 4675 non-null float64
36 products.taxless_price 4675 non-null float64
37 products.unit_discount_amount 4675 non-null float64
38 sku 4675 non-null object
39 taxful_total_price 4675 non-null float64
40 taxless_total_price 4675 non-null float64
41 total_quantity 4675 non-null int64
42 total_unique_products 4675 non-null int64
43 type 4675 non-null object
44 user 4675 non-null object
dtypes: datetime64[ns](3), float64(12), int64(5), object(25)
memory usage: 96.0 bytes
# calculating descriptive statistics from the Eland dataframe:
ed_ecommerce.describe()
png

We can also notice that memory usage went from around 840 Kbs in the pandas dataframe to only 96 bytes in the Eland dataframe. We don't need to hold the entire dataset in memory to retrieve the information we require from the index. Most of the workload remains in the Elasticsearch cluster (3) as aggregations or specific queries.

For such a small dataset, this is not that important. Still, as we scale to Gigabytes of data, the benefits of not holding everything in memory for simple computations and analysis are more noticeable.

Elasticsearch capabilities with DataFrames

Eland abstracts many of the already existing APIs in Elasticsearch without data scientists needing to learn Elasticsearch's specific syntax. For example, it is possible to get the mapping of an index (equivalent to retrieving thedtypes attribute of a pandas DataFrame). Still, it is not immediately apparent or how to do it. With the Eland DataFrame object, we can retrieve thedtypesattribute as we would do on a regular pandas DataFrame.

# getting the dtypes from pandas dataframe:
df_ecommerce.dtypes
category object
currency object
customer_first_name object
customer_full_name object
customer_gender object
...
total_quantity int64
total_unique_products int64
type object
user object
geoip object
Length: 23, dtype: object
# retrieving the Data types for the index normally would require us to perform the following Elasticsearch query:
mapping = es.indices.get_mapping(index_name)

# which by itself is an abstraction of the GET request for mapping retrieval
print(json.dumps(mapping, indent=2, sort_keys=True))
{
"kibana_sample_data_ecommerce": {
"mappings": {
"properties": {
"category": {
"fields": {
"keyword": {
"type": "keyword"
}
},
"type": "text"
},
"currency": {
"type": "keyword"
},
"customer_birth_date": {
"type": "date"
},
"customer_first_name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"customer_full_name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
}
...
# Eland abstracts this procedure into the same pandas api:
ed_ecommerce.dtypes
category object
currency object
customer_birth_date datetime64[ns]
customer_first_name object
customer_full_name object
...
taxless_total_price float64
total_quantity int64
total_unique_products int64
type object
user object
Length: 45, dtype: object

With these abstractions in place, Eland allows us to use core Elasticsearch features that are not part of pandas (or at least are not as efficient), such as full-text search, Elasticsearch's most prominent use case.

# defining the full-text query we need: Retrieving records for either Elitelligence or Primemaster manufacturer
query = {
"query_string" : {
"fields" : ["manufacturer"],
"query" : "Elitelligence OR Primemaster"
}
}
# using full-text search capabilities with Eland:
text_search_df = ed_ecommerce.es_query(query)

# visualizing price of products for each manufacturer using pandas column syntax:
text_search_df[['manufacturer','products.price']]
png

More possibilities for integrations

This article only touches the surface of the possibilities Eland opens for data scientists and other data professionals using Elasticsearch on day-to-day operations.

Especially in DevOps and AIOps contexts, where ML-based tools are not very mature yet, data professionals can benefit from Python's existing machine learning ecosystem to analyze large amounts of Observability and Metrics data, which will be a topic for another article.

Eland is undoubtedly a big step towards Elasticsearch, and I look forward to what future versions of the ELK stack will bring to the table.

If you enjoyed this article

Check out this webinar where Seth Michael Larson (one of Eland's main contributors) goes over Eland's main features.

If you would like to see more content about Elasticsearch, Data Science, Information Retrieval, NLP in the context of Observability, feel free to connect with me on LinkedIn and read my other articles on these topics.

Footnotes:

  1. As of version 7.8 of the ELK stack, Regression and Classification Machine learning jobs are still experimental.
  2. There are ways of using a more familiar SQL-like querying interface for these tasks, such as Elasticsearch's JDBC driver. However, it would still require some amount of familiarity with Elasticsearch concepts (index patterns and pagination, for example).
  3. This works similarly to other distributed computing modules, like Dask. Eland's case essentially maintains a query builder and task graph internally and runs these delayed tasks only when the data is requested.

References:

--

--

Data Scientist @microsoft | Passionate about sharing knowledge and helping businesses build products that empower people.