CSV Files for Storage? Absolutely Not. Use Apache Avro Instead

Photo by Startup Stock Photos from Pexels

I’m not a fan of CSVs. They’re slow to query and take more disk space than required. Think expensive if you’re using the cloud to process large CSV files.

There are many alternatives – such as Excel, databases, and HDFs – but one in particular stands above the competition. That is, if you want to stick with row storage. It’s called Apache Avro, and you’ll learn all about it today.

If you want even more efficient columnar data format, look no further than Parquet:

CSV Files for Storage? No Thanks. There’s a Better Option

Today’s article answers the following questions:

What is Avro?
How to work with Avro in Python?
CSV or Avro – which is more efficient?

But what is Avro?

Avro is an open-source project which provides services of data serialization and exchange for Apache Hadoop. Don’t worry – you don’t need a big data environment – your laptop will do just fine. The guy behind Avro is Doug Cutting. He’s also called the father of Hadoop.

If you’re unfamiliar with the term serialization, here’s a crash course. It is a process of converting objects such as arrays and dictionaries into byte streams that can be efficiently stored and transferred elsewhere. You can then deserialize the byte stream to get the original objects back.

Avro is row-orientated, just like CSV. This makes it different from, let’s say, Parquet, but it’s still a highly efficient data format.

Here’s another crash course – for those unfamiliar with the difference between row and column storage. Imagine you have the following data:

Image 1 - Dummy table data (image by author) — Image 1 – Dummy table data (image by author)

Here’s how it would be organized in both row and column storage:

Image 2 - Row vs. column storage (image by author) — Image 2 – Row vs. column storage (image by author)

Column storage files are more lightweight, as adequate compression can be made for each column. Row storage doesn’t work in that way, since a single row can have multiple data types.

But here’s the deal with Avro – it stores a JSON-like schema with the data, so the correct data types are known in advance. That’s where the compression happens.

In a nutshell, a single Avro file contains a JSON-like schema for data types and the data itself in binary format.

Avro has an API for every major programming language. But there’s a catch – Pandas doesn’t support it by default. So no, you don’t have access to read_avro() and to_avro() functions. You’ll have to work around that.

It’s not difficult – as you’ll see next.

How to work with Avro in Python?

There are two installable libraries for working with Avro files:

avro – Documentation link
fastavro – Documentation link

The latter states the former library is dog slow, as it takes about 14 seconds to process 10K records. You’ll stick with fastavro for that reason.

Here’s how to set up a new virtual environment and install necessary libraries (for Anaconda users):

conda create --name avro_env python=3.8
conda activate avro_env

conda install -c conda-forge pandas fastavro jupyter jupyterlab

Execute the following command to start JupyterLab session:

jupyter lab

You’ll use the NYSE stock prices dataset for the hands-on part. The dataset comes in CSV format – around 50 MB in size. Use the following snippet to import the required libraries and load the dataset:

import pandas as pd
from fastavro import writer, reader, parse_schema

df = pd.read_csv('prices.csv')
df.head()

Here’s how the stock prices dataset looks like:

Image 3 - NYSE stock prices dataset (image by author) — Image 3 – NYSE stock prices dataset (image by author)

Converting a Pandas DataFrame to Avro file is a three-step process:

Define the schema – You’ll have to define a JSON-like schema to specify what fields are expected, alongside their respective data types. Write it as a Python dictionary and parse it using fastavro.parse_schema().
Convert the DataFrame to a list of records – Use to_dict('records') function from Pandas to convert a DataFrame to a list of dictionary objects.
Write to Avro file – Use fastavro.writer() to save the Avro file.

Here’s how all three steps look like in code:

# 1. Define the schema
schema = {
    'doc': 'NYSE prices',
    'name': 'NYSE',
    'namespace': 'stocks',
    'type': 'record',
    'fields': [
        {'name': 'date', 'type': {
            'type': 'string', 'logicalType': 'time-millis'
        }},
        {'name': 'symbol', 'type': 'string'},
        {'name': 'open', 'type': 'float'},
        {'name': 'close', 'type': 'float'},
        {'name': 'low', 'type': 'float'},
        {'name': 'high', 'type': 'float'},
        {'name': 'volume', 'type': 'float'}
    ]
}
parsed_schema = parse_schema(schema)

# 2. Convert pd.DataFrame to records - list of dictionaries
records = df.to_dict('records')

# 3. Write to Avro file
with open('prices.avro', 'wb') as out:
    writer(out, parsed_schema, records)

It’s not as straightforward as calling a single function, but it isn’t that difficult either. It could get tedious if your dataset has hundreds of columns, but that’s the price you pay for efficiency.

There’s also room for automating name and type generation. Get creative. I’m sure you can handle it.

Going from Avro to Pandas DataFrame is also a three-step process:

Create a list to store the records – This list will store dictionary objects you can later convert to Pandas DataFrame.
Read and parse the Avro file – Use fastavro.reader() to read the file and then iterate over the records.
Convert to Pandas DataFrame – Call pd.DataFrame() and pass in a list of parsed records.

Here’s the code:

# 1. List to store the records
avro_records = []

# 2. Read the Avro file
with open('prices.avro', 'rb') as fo:
    avro_reader = reader(fo)
    for record in avro_reader:
        avro_records.append(record)

# 3. Convert to pd.DataFrame
df_avro = pd.DataFrame(avro_records)

# Print the first couple of rows
df_avro.head()

And here’s how the first couple of rows look like:

Image 4 - NYSE stock prices dataset read from Avro file (image by author) — Image 4 – NYSE stock prices dataset read from Avro file (image by author)

Both CSV and Avro versions of the dataset are identical – but which one should you use? Let’s answer that next.

CSV vs. Avro – Which one should you use?

It’s a bit tricky to answer this question. Some people like CSVs because you can edit them directly. Avro doesn’t come with that option. All results you’ll see are based on the original 50 MB CSV file. Your mileage may vary.

Here’s a comparison between read times – pd.read_csv() and fastavro.reader(), alongside the appending to a list:

Image 5 - CSV vs. Avro read time (CSV = 0.315s, Avro = 1.15s) (image by author) — Image 5 – CSV vs. Avro read time (CSV = 0.315s, Avro = 1.15s) (image by author)

Pandas is about three times faster here, provided a 50 MB dataset.

Here’s a comparison between write times – pd.to_csv() and fastavro.writer():

Image 6 - CSV vs. Avro write time (CSV = 2.22s, Avro = 1.33s) (image by author) — Image 6 – CSV vs. Avro write time (CSV = 2.22s, Avro = 1.33s) (image by author)

The tables have turned – Pandas takes longer to save the identical dataset.

The final comparison is probably the most interesting one – file size comparison. Avro should provide a significant reduction in file size due to serialization. Here are the results:

Image 7 - CSV vs. Avro file size (CSV = 51.7MB, Avro = 29.9MB) (image by author) — Image 7 – CSV vs. Avro file size (CSV = 51.7MB, Avro = 29.9MB) (image by author)

Not quite the reduction Parquet offered, but it still isn’t too bad.

Does it matter for 50 MB datasets? Probably not, but the savings scale on larger datasets, too. That’s especially important if you’re storing data on the cloud and paying for the overall size.

That’s something to think about.

Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.