Elasticsearch features with the convenience of pandas

Elasticsearch for Data Science just got way easier

Eland is a brand new python package that bridges the gap between Elasticsearch and the Data Science ecosystem.

Mateus Picanço

Published in

Towards Data Science

10 min readJul 13, 2020

Elasticsearch is a feature-rich, open-source search engine built on top of Apache Lucene, one of the most important full-text search engines on the market.

Elasticsearch is best known for the vast and versatile REST API experience it provides, including efficient wrappers for full-text search, sorting, and aggregation tasks, making it a lot easier to implement such capabilities in existing backends without the need for complex re-engineering.

Ever since its introduction in 2010, Elasticsearch gained a lot of traction in the software engineering domain, and by 2016 it became the most popular enterprise search-engine software stack according to DBMS knowledge base DB-engines, surpassing the industry-standard Apache Solr (which is also built on top of Lucene).

Google Trends interest for Elasticsearch since its release in 2010 (Worldwide) — Google Trends data for Elasticsearch since its release in 2010

One of the things that makes Elasticsearch so popular is the ecosystem it garnered. Engineers worldwide developed open-source Elasticsearch integrations and extensions, and many of these projects were absorbed by Elastic (the company behind the Elasticsearch project) as part of their stack.

Some of the projects were Logstash (data processing pipeline, commonly used for parsing text-based files) and Kibana (visualization layer built on top of Elasticsearch), leading towards the now widely adopted ELK (Elasticsearch, Logstash, Kibana) stack.

The ELK stack quickly gained notoriety due to its impressive set of possible applications across emerging and consolidated tech domains, such as DevOps, Site-Reliability Engineering, and, most recently, Data Analytics.

But what about Data Science?

Chances are that if you're a data scientist reading this article and have Elasticsearch as part of your employer's tech stack, you might have had some problems trying to use all the features Elasticsearch provides for data analysis and even for simple machine learning tasks.

Data scientists are generally not used to NoSQL database engines for everyday tasks or even relying on complex REST APIs for analysis. Dealing with large amounts of data using Elasticsearch's low-level python clients, for example, is also not that intuitive and has somewhat of a steep learning curve for someone coming from a field different from SWE.

Although Elastic made significant efforts in enhancing the ELK stack for Analytics and Data Science use cases, it still lacked an easy interface with the existing Data Science ecosystem (pandas, NumPy, scikit-learn, PyTorch, and other popular libraries).

In 2017, Elastic took its first step towards the data science field and, as an answer to the growing popularity of Machine Learning and predictive technologies in the software industry, released their first ML-capable X-pack (extension pack) for the ELK stack, adding Anomaly Detection and other unsupervised ML tasks to its features. Not long after that, Regression and Classification models (1) were also added to the set of ML tasks available in the ELK stack.

Last week another step towards Elasticsearch achieving widespread adoption in the data science industry, with the release of Eland, a brand new Python Elasticsearch client and toolkit with a powerful (and familiar) pandas-like API for analysis, ETL, and Machine Learning.

Eland: Elastic and Data

Eland enables data scientists to efficiently use the already robust Elasticsearch analysis and ML capabilities without requiring a deep knowledge of Elasticsearch and its many intricacies.

Features and concepts from Elasticsearch were translated into a much more recognizable setting. For instance, an Elasticsearch index, with its documents, mappings, and fields, becomes a dataframe, with rows and columns, much like we are used to seeing when using pandas.

# Importing Eland and low-level Elasticsearch clients for comparison
import eland as ed
from eland.conftest import *
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Import pandas and numpy for data wrangling
import pandas as pd
import numpy as np

# For pretty-printing
import json

Typical data science use cases such as reading an entire Elasticsearch index into a pandas dataframe for Exploratory Data Analysis or training an ML model would usually require some not-so-efficient shortcuts.

# name of the index we want to query
index_name = 'kibana_sample_data_ecommerce' 

# instantiating client connect to localhost by default
es = Elasticsearch()

# defining the search statement to get all records in an index
search = Search(using=es, index=index_name).query("match_all") 

# retrieving the documents from the search
documents = [hit.to_dict() for hit in search.scan()] 

# converting the list of hit dictionaries into a pandas dataframe:
df_ecommerce = pd.DataFrame.from_records(documents)# visualizing the dataframe with the results:
df_ecommerce.head()['geoip']0    {'country_iso_code': 'EG', 'location': {'lon':...
1    {'country_iso_code': 'AE', 'location': {'lon':...
2    {'country_iso_code': 'US', 'location': {'lon':...
3    {'country_iso_code': 'GB', 'location': {'lon':...
4    {'country_iso_code': 'EG', 'location': {'lon':...
Name: geoip, dtype: object# retrieving a summary of the columns in the dataset:
df_ecommerce.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675 entries, 0 to 4674
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   category               4675 non-null   object 
 1   currency               4675 non-null   object 
 2   customer_first_name    4675 non-null   object 
 3   customer_full_name     4675 non-null   object 
 4   customer_gender        4675 non-null   object 
 5   customer_id            4675 non-null   int64  
 6   customer_last_name     4675 non-null   object 
 7   customer_phone         4675 non-null   object 
 8   day_of_week            4675 non-null   object 
 9   day_of_week_i          4675 non-null   int64  
 10  email                  4675 non-null   object 
 11  manufacturer           4675 non-null   object 
 12  order_date             4675 non-null   object 
 13  order_id               4675 non-null   int64  
 14  products               4675 non-null   object 
 15  sku                    4675 non-null   object 
 16  taxful_total_price     4675 non-null   float64
 17  taxless_total_price    4675 non-null   float64
 18  total_quantity         4675 non-null   int64  
 19  total_unique_products  4675 non-null   int64  
 20  type                   4675 non-null   object 
 21  user                   4675 non-null   object 
 22  geoip                  4675 non-null   object 
dtypes: float64(2), int64(5), object(16)
memory usage: 840.2+ KB# getting descriptive statistics from the dataframe
df_ecommerce.describe()

The procedure described above would lead us to get all the documents in the index as a list of dictionaries, only to load them into a pandas dataframe. That means having both the documents themselves and the resulting dataframe in memory at some point in the process (2). This procedure would not always be feasible for Big Data applications, and exploring the dataset in the Jupyter Notebook environment could become very complicated and fast.

Eland enables us to perform very similar operations to the ones described above, without any of the friction involved in adapting them to the Elasticsearh context, while still using Elasticsearch aggregation speed and search features.

# loading the data from the Sample Ecommerce data from Kibana into Eland dataframe:
ed_ecommerce = ed.read_es('localhost', index_name)# visualizing the results:
ed_ecommerce.head()

As an added feature that would require a bit more wrangling on the pandas' side, the field geoip (which is a nested JSON object in the index) was seamlessly parsed into columns in our dataframe. We can see that by calling the .info() method on the Eland dataframe.

# retrieving a summary of the columns in the dataframe:
ed_ecommerce.info()<class 'eland.dataframe.DataFrame'>
Index: 4675 entries, jyzpQ3MBG9Z35ZT1wBWt to 0SzpQ3MBG9Z35ZT1yyej
Data columns (total 45 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   category                       4675 non-null   object        
 1   currency                       4675 non-null   object        
 2   customer_birth_date            0 non-null      datetime64[ns]
 3   customer_first_name            4675 non-null   object        
 4   customer_full_name             4675 non-null   object        
 5   customer_gender                4675 non-null   object        
 6   customer_id                    4675 non-null   object        
 7   customer_last_name             4675 non-null   object        
 8   customer_phone                 4675 non-null   object        
 9   day_of_week                    4675 non-null   object        
 10  day_of_week_i                  4675 non-null   int64         
 11  email                          4675 non-null   object        
 12  geoip.city_name                4094 non-null   object        
 13  geoip.continent_name           4675 non-null   object        
 14  geoip.country_iso_code         4675 non-null   object        
 15  geoip.location                 4675 non-null   object        
 16  geoip.region_name              3924 non-null   object        
 17  manufacturer                   4675 non-null   object        
 18  order_date                     4675 non-null   datetime64[ns]
 19  order_id                       4675 non-null   object        
 20  products._id                   4675 non-null   object        
 21  products.base_price            4675 non-null   float64       
 22  products.base_unit_price       4675 non-null   float64       
 23  products.category              4675 non-null   object        
 24  products.created_on            4675 non-null   datetime64[ns]
 25  products.discount_amount       4675 non-null   float64       
 26  products.discount_percentage   4675 non-null   float64       
 27  products.manufacturer          4675 non-null   object        
 28  products.min_price             4675 non-null   float64       
 29  products.price                 4675 non-null   float64       
 30  products.product_id            4675 non-null   int64         
 31  products.product_name          4675 non-null   object        
 32  products.quantity              4675 non-null   int64         
 33  products.sku                   4675 non-null   object        
 34  products.tax_amount            4675 non-null   float64       
 35  products.taxful_price          4675 non-null   float64       
 36  products.taxless_price         4675 non-null   float64       
 37  products.unit_discount_amount  4675 non-null   float64       
 38  sku                            4675 non-null   object        
 39  taxful_total_price             4675 non-null   float64       
 40  taxless_total_price            4675 non-null   float64       
 41  total_quantity                 4675 non-null   int64         
 42  total_unique_products          4675 non-null   int64         
 43  type                           4675 non-null   object        
 44  user                           4675 non-null   object        
dtypes: datetime64[ns](3), float64(12), int64(5), object(25)
memory usage: 96.0 bytes# calculating descriptive statistics from the Eland dataframe:
ed_ecommerce.describe()

We can also notice that memory usage went from around 840 Kbs in the pandas dataframe to only 96 bytes in the Eland dataframe. We don't need to hold the entire dataset in memory to retrieve the information we require from the index. Most of the workload remains in the Elasticsearch cluster (3) as aggregations or specific queries.

For such a small dataset, this is not that important. Still, as we scale to Gigabytes of data, the benefits of not holding everything in memory for simple computations and analysis are more noticeable.

Elasticsearch capabilities with DataFrames

Eland abstracts many of the already existing APIs in Elasticsearch without data scientists needing to learn Elasticsearch's specific syntax. For example, it is possible to get the mapping of an index (equivalent to retrieving thedtypes attribute of a pandas DataFrame). Still, it is not immediately apparent or how to do it. With the Eland DataFrame object, we can retrieve thedtypesattribute as we would do on a regular pandas DataFrame.

# getting the dtypes from pandas dataframe:
df_ecommerce.dtypescategory                 object
currency                 object
customer_first_name      object
customer_full_name       object
customer_gender          object
                          ...  
total_quantity            int64
total_unique_products     int64
type                     object
user                     object
geoip                    object
Length: 23, dtype: object# retrieving the Data types for the index normally would require us to perform the following Elasticsearch query:
mapping = es.indices.get_mapping(index_name) 

# which by itself is an abstraction of the GET request for mapping retrieval
print(json.dumps(mapping, indent=2, sort_keys=True)){
  "kibana_sample_data_ecommerce": {
    "mappings": {
      "properties": {
        "category": {
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          },
          "type": "text"
        },
        "currency": {
          "type": "keyword"
        },
        "customer_birth_date": {
          "type": "date"
        },
        "customer_first_name": {
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          },
          "type": "text"
        },
        "customer_full_name": {
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          },
          "type": "text"
        }
...# Eland abstracts this procedure into the same pandas api:
ed_ecommerce.dtypescategory                         object
currency                         object
customer_birth_date      datetime64[ns]
customer_first_name              object
customer_full_name               object
                              ...      
taxless_total_price             float64
total_quantity                    int64
total_unique_products             int64
type                             object
user                             object
Length: 45, dtype: object

With these abstractions in place, Eland allows us to use core Elasticsearch features that are not part of pandas (or at least are not as efficient), such as full-text search, Elasticsearch's most prominent use case.

# defining the full-text query we need: Retrieving records for either Elitelligence or Primemaster manufacturer
query = {
        "query_string" : {
            "fields" : ["manufacturer"],
            "query" : "Elitelligence OR Primemaster"
        }
    }# using full-text search capabilities with Eland:
text_search_df = ed_ecommerce.es_query(query)

# visualizing price of products for each manufacturer using pandas column syntax:
text_search_df[['manufacturer','products.price']]

More possibilities for integrations

This article only touches the surface of the possibilities Eland opens for data scientists and other data professionals using Elasticsearch on day-to-day operations.

Especially in DevOps and AIOps contexts, where ML-based tools are not very mature yet, data professionals can benefit from Python's existing machine learning ecosystem to analyze large amounts of Observability and Metrics data, which will be a topic for another article.

Eland is undoubtedly a big step towards Elasticsearch, and I look forward to what future versions of the ELK stack will bring to the table.

If you enjoyed this article

Check out this webinar where Seth Michael Larson (one of Eland's main contributors) goes over Eland's main features.

If you would like to see more content about Elasticsearch, Data Science, Information Retrieval, NLP in the context of Observability, feel free to connect with me on LinkedIn and read my other articles on these topics.

Footnotes:

As of version 7.8 of the ELK stack, Regression and Classification Machine learning jobs are still experimental.
There are ways of using a more familiar SQL-like querying interface for these tasks, such as Elasticsearch's JDBC driver. However, it would still require some amount of familiarity with Elasticsearch concepts (index patterns and pagination, for example).
This works similarly to other distributed computing modules, like Dask. Eland's case essentially maintains a query builder and task graph internally and runs these delayed tasks only when the data is requested.

References:

P. Andlinger, Elasticsearch replaced Solr as the most popular search engine (2016), URL for the blog post on DB-engines.
Elastic.co, Official Eland documentation (2020), URL for the website.
Seth Michael Larson, Introduction into Eland — DataFrames and Machine Learning backed by Elasticsearch (2020), URL for the webinar.