Exporting Pandas Data to Elasticsearch

John A. Dungan

Published in

Towards Data Science

4 min readFeb 24, 2019

How to send your dataframe rows to an elasticsearch database.

Introduction

So you’re done downloading and analyzing your data and ready to move it to an Elasticsearch database. This article covers preparing your data and sending it to an elasticsearch endpoint.

Requirements

Python 3.6.5
numpy==1.15.0
pandas==0.23.4
elasticsearch==6.3.1import numpy as np
import pandas as pdfrom elasticsearch import Elasticsearch
from elasticsearch import helpers
es = Elasticsearch(http_compress=True)

Cleaning up your data

Pandas dataframes are happy to hold NaN values in your data, but tolerance for null values is not a trait of elasticsearch. This makes sense when you think about it; you are asking an indexer to index nothing. You can avoid this by running your data through some simple functions.

Blank Dates

If your dataframe has blank dates, you’ll need to convert it to value elasticsearch accepts. Dates in elasticsearch can be formatted date strings (e.g. “6–9–2016"), milliseconds since the Unix Epoch or seconds since the Unix Epoc (elastic docs). Null dates using the milliseconds since the Unix Epoch turns out to be January 1, 1970. If you have historical dates that include the early 70’s you may want to consider something else.

Here is a simple function you can use with dataframe .apply() to clean up your date columns.

from datetime import datetimedef safe_date(date_value):
    return (
        pd.to_datetime(date_value) if not pd.isna(date_value)
            else  datetime(1970,1,1,0,0)
    )df['ImportantDate'] = df['ImportantDate'].apply(safe_date)

Avoid Other Blank Values

Any fields with blank values are just a problematic as empty dates. String values are easier than dates, but you need to supply a value. The code below swaps a blank for a safe string using the df.apply() function.

def safe_value(field_val):
    return field_val if not pd.isna(field_val) else "Other"df['Hold'] = df['PossiblyBlankField'].apply(safe_value)

Creating Documents

Once you are confident your data is ready to send to Elasticsearch it’s time to convert a row into a document. Panda dataframes have a handy ‘iterrows’ function to make this straight forward. It returns an index of the row and an object containing the row values. This is a ‘pandas.core.series.Series’ object but behaves like a conventional python dict. This snippet demonstrates assigning the iter response to variables.

df_iter = df.iterrows()index, document = next(df_iter)

Elasticsearch needs the data as a python dict which is easy enough using the .to_dict() method of the Series object. However, you can be selective about the data to be sent to the database and use a simple filter function. Notice that the function below returns a dict comprehension.

use_these_keys = ['id', 'FirstName', 'LastName', 'ImportantDate']def filterKeys(document):
    return {key: document[key] for key in use_these_keys }

The Generator

We are finally ready to send data to Elasticsearch using the python client and helpers. The helper.bulk api requires an instance of the Elasticsearch client and a generator. If you’re not familiar with generators go learn about their memory respecting benefits. If you don’t have time for that just understand the magic is the yield which a hands off data only as the bulk.helpers function asks for it. Here is the code for walking through your dataframe and shipping it off to Elasticsearch.

from elasticsearch import Elasticsearch
from elasticsearch import helpers
es_client = Elasticsearch(http_compress=True)def doc_generator(df):
    df_iter = df.iterrows()
    for index, document in df_iter:
        yield {
                "_index": 'your_index',
                "_type": "_doc",
                "_id" : f"{document['id']}",
                "_source": filterKeys(document),
            }
    raise StopIterationhelpers.bulk(es_client, doc_generator(your_dataframe))

Breaking down the generated dict

The doc_generator’s job is simply to provide a dict with certain values. Here are some comments on the details of what’s going on here.

"_index": 'your_index',

This is the name of your index in Elasticsearch. If you don’t have an index, you can use any valid index name here. Elasticsearch will do it’s best to automatically index your documents. However, creating your index in advance is a good idea to avoid rejected documents and optimize the indexing process.

"_type": "_doc",

Take note: _type is being deprecated by Elasticsearch. Version 6.3.1 still supports named types, but this is a good time to start converting to ‘_doc’.

“_id” : f”{document[‘id’]}”,

The _id is a unique id for Elasticsearch. Don’t confuse it with your own ‘id’ field in your document. This might be a good place to add the index variable from itterows() to make the document more unique with something like f”{document['id']+index}".

"_source": filterKeys(document),

The _source is the heart of this exercise: the document to be saved. Using document.to_dict() or any other valid python dict will work.

raise StopIteration

I include this line as a courtesy. The bulk.helpers function will handle an abrupt end of the generator. But raising the StopIteration saves it the trouble.

Conclusion

With attention to a few details transferring your pandas data to an Elasticsearch database can be drama free. Just remember that null values are a problem for elasticsearch. The rest is creating a generator to process your rows into python dictionaries.