The world’s leading publication for data science, AI, and ML professionals.

CSVs Are Overrated! I Give up Some of Its Benefits to Gain More.

What I use instead for to have a small file size and better performance.

CSVs: Why I Gave Up Some of Their Benefits to Gain Others

The alternative I use allows for smaller file sizes and better performance.

Photo by Jose Aragones on Unsplash
Photo by Jose Aragones on Unsplash

CSVs are good, but they’re overrated.

I’ve been using CSV for a very long time. Just as the rest of the Data Science community. Then Pickle for some time.

CSVs work on every system without installing anything. After all, they are a form of plain text files with a comma delimiter. This fact also makes them super simple to understand.

But we’ve been in this comfort zone for too long.

Most data applications benefit from giving up a little bit of flexibility. We can speed up data reading and writing time enormously by installing one more package.

In this post, we’ll discuss…

· Issues with CSV for Data Scientists; · My thoughts on using Pickle files instead; · Better alternatives for CSVs and Pickle; · Benchmarking different file formats to store datasets

Python 3.11 Is Indeed Faster Than 3.10

You can access the Colab notebook I used in this post for benchmarking.

Issues with CSV for Data Scientists.

CSVs have been around for as long as we began storing data. They aren’t different from text files, except CSVs follow a predictable pattern of commas.

Softwares use this information to split the dataset into columns. Columns aren’t a miracle of the CSV files themselves.

Even column headers and rows have no distinction inside the CSV file. We need to configure the software that reads the CSV to pick titles and row numbers.

Some software is smart enough to pick headers if the rest of the column follows a different data type. But they are only educated guesses programmed by someone.

Simplicity works well in many instances. Especially if you have no information on what software the client is using, CSVs are great. Share and forget!

This Tiny Python Package Creates Huge Augmented Datasets

But CSVs aren’t optimized for storage or performance.

We can think of the accessibility performance and file size as three corners of a tread-off triangle. You adjust one; the other two adjust themselves.

Image by the author.
Image by the author.

CSVs have lifted the accessibility node to its greatest. It leaves performance and files much weaker. Later in this post, we’ll compare the file sizes, saving, and loading times of CSVs with other formats.

Another drawback of CSVs is when the text has Unicode characters. You may need to explicitly set the encoding parameter as one of the many supported values.

Here’s an example of how you can set encoding in Pandas.

df = read_csv('/path/to/file.csv', encoding = "ISO-8859-1")

If your dataset is extensive and you don’t know which encoder to use, you’re in trouble.

We need a file format that stores file metadata (such as headers and encoding,) takes the least amount of time to read and write from the disk and is smaller in size.

How to Become a Terrific Data Scientist (+Engineer) Without Coding

How about using Pickle instead?

Pickle is a Python object storage format. It’s not designed for storing dataframes but works well on them.

Pickle could also store the data frame’s header, row numbers, and other meta information. So, if you’re reading dataframes with Pandas, the engine doesn’t have to spend much time figuring out data types and headers.

Pickle-to-disk and disk-to-pickle are almost the same as memory-to-disk and disk-to-memory.

The size of a Pickle file on the disk may vary. In most cases, it’s slightly larger than CSVs as they store more information about the dataset. But, Pickle also stores data converted into a byte-stream. It may make the size smaller than text-based CSVs.

Also, when you’re using Python, you need no other installation to benefit from Pickle.

df.to_pickle("./path/to/file.pkl") # to write a dataframe as pickle
df = pd.read_pickle("./path/to/file.pkl") # to read a pickled dataframe

On our tread-off triangle, pickle files give up a bit of accessibility to get the performance and file size benefits. Because you can read pickle files only from a Python program. They may not work well if others in your organization are using R, Excel, or other software.

How to Serve Massive Computations Using Python Web Apps.

Pickling may solve a number of issues with file storage. But there’s a better one if you are okay with narrowing the accessibility trait further.

An ideal use case of Pickle files for data scientists.

We’ll discuss powerful alternatives for CSVs and Pickle in the following few sections.

Yet, they aren’t to be discarded altogether. Not yet!

Since Pickle files are a snapshot of the object in the memory, it’s great to store trained ML models.

For instance, we store Machine Learning models and their weights as pickle files. Popular libraries such as scikit-learn and Tensorflow create pickable models.

Here’s an example with Tensorflow.

import joblib
import tensorflow as tf

model = tf.keras.Sequential([
            tf.keras.layers.Input(shape=(5,)),
            tf.keras.layers.Dense(units=16, activation='elu'),
            tf.keras.layers.Dense(units=16, activation='elu'),
            tf.keras.layers.Dense(units=8, activation='elu'),
            tf.keras.layers.Dense(units=8, activation='elu'),
            tf.keras.layers.Dense(units=5, activation='softmax'),
        ])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(df['predictorrs'], df['labels'], epochs=200, batch_size=8)
joblib.dump(model, 'classifier.pkl')

Better alternatives for CSVs and Pickle

I’ve been using CSV for a very long time. Just as the rest of the data science community. Then pickle for some time. Today, I prefer other binary file formats to store data.

Two of them in particular. Feather & Parquet.

How to Speed up Python Data Pipelines up to 91X?

Feather file format

The feather file format is a fast, language-agnostic data frame storage for Python (pandas) and R.

Feather is optimized for low storage space and high performance. This makes it a little less accessible than CSVs.

While CSVs could work on any machine that could understand text, Feather works only with Python and R. Also, it doesn’t come pre-installed. We need to install it manually.

You can grab it from the PyPI repository if you’re using Python.

pip install feather-format

R programmers can install Feather directly from the Git repository using the following command.

devtools::install_github("wesm/feather/R")

Once you’ve installed the package, changes to your existing code base are minimal.

Using Feather formats is fairly simple.

The Pandas library has inbuilt methods for feather format. You can use to_feather and read_feather to save and load data on the disk.

# Change this
df = pd.read_csv(...)
# To this
df = pd.read_feather(...)

# Change this
df.to_csv(...)
#To this
df.to_feather(...)

Parquet file format

Parquet is another binary file format that provides advantages over using text files.

Parquet uses the record shredding and assembly algorithm described in the Dremel paper. It represents nested structures efficiently in columnar storage.

As a result, queries that process large amounts of nesting can execute faster than their equivalents in text-based formats.

The file size of Parquet is usually smaller than text formats.

Using parquet on Pandas is straightforward like Feather format.

# Change this
df = pd.read_csv(...)
# To this
df = pd.read_parquet(...)
# Change this
df.to_csv(...)
#To this
df.to_parquet(...)

Benchmarking different file formats to store datasets

I wanted to see it for myself. I wanted to test the various aspects of each file format.

So I wrote a little script. It fetches data from a publicly available dataset. Then it creates 1000 different files of 10,000 rows from the original. To do this, the script uses random sampling on Pandas dataframe.

How to Detect Memory Leakage in Your Python Application

We then record the time it takes to write all those files to the disk, the file size on the disk, and the time it takes to read all those files back to the memory.

We measure the above three values across all the file formats. Here are the results.

Parquet is the most miniature file format.

Benchmarking file sizes of different file formats - image by the author.
Benchmarking file sizes of different file formats – image by the author.

Parquet files are significantly smaller than CSVs. They are even smaller than feather files.

On the other hand, JSON is the worst format for storing files on the disk. It takes file formats more than twice the space CSVs take.

In terms of file size, Feather is slightly smaller, and Pickle takes a bit more space compared to CSVs.

Feather is the fastest to read and write.

File read/write performance compared between different file formats to store datasets - Image by the author.
File read/write performance compared between different file formats to store datasets – Image by the author.

Feather’s read performance is impressive. It takes about half the time CSVs take to load. Even Pickle files are significantly slower compared to Feather.

Again, JSONs take too much time to load into the memory because it takes large disk storage space.

Feather is far better than CSV on writing performance too. Only Pickle has comparatively similar (still slow) numbers to Feather.

Python Web Apps Are a Terrible Idea for Analytics Projects.

Final considerations

If you’re sharing data with someone else and you are not sure if they have Python or any compatible software installed, it’s okay to use CSVs.

But on all other occasions, CSVs terribly hurt you.

If your primary objective is to store data at a very small cost, use Parquet. It takes little space on the disk and its read/write performance is impressive.

If you’re using Python or R and read/write performance is your primary concern, Feather is far better than CSVs and Pickle. It’s also smaller in size.


Thanks for reading, friend! Say Hi to me on LinkedIn, Twitter, and Medium.

Not a Medium member yet? Please use this link to become a member because, at no extra cost for you, I earn a small commission for referring you.


Related Articles