Why I Stopped Dumping DataFrames to a CSV and Why You Should Too
It’s time to say goodbye to pd.to_csv() and pd.read_csv()
Building an end-to-end data-driven pipeline is challenging and demanding. Having been there myself, the process is extremely tedious, and one may inevitably end up with numerous intermediate files. While these files are usually meant to serve as checkpoints or assist further modules in the pipeline, one might be unknowingly compromising the run-time and lifting storage requirements by not choosing the appropriate format for these intermediate files — the first preference for which is always a CSV.
Being a Data Scientist myself, I understand that a CSV provides enormous flexibility in data reading, writing, previewing, exploring, etc. It’s the go-to format for you, me, and almost everyone working with DataFrames. More often than not, I used to leverage the CSV format too to export a DataFrame UNTIL much recently when I discovered a few time-efficient and storage-optimized alternatives to a CSV.
Fortunately, Pandas offers a variety of file formats you can save your DataFrames to, such as:
- CSV
- Pickle
- Parquet
- Feather
- JSON
- HDF5
This enticed me to rank the above-mentioned formats based on their experiential performance on the following parameters:
- The space they occupy on disk.
- The time they take for read and write operations to disk.
Experimental Setup
For experimentation purposes, I generated a random dataset in Python with a million rows and thirty columns — encompassing string, float and integer data types.
I repeated each experiment described below ten times to reduce randomness and draw fair conclusions from the observed results. The statistics below are averages across the ten experiments.
Experiments
Experiment 1: Space Utilised on Disk after saving

- Clearly, HDF5 should not be your first choice if you are looking for a memory-optimized format. Here, the disk space utilized is more than double the following best format visible in the above bar chart — JSON, which itself is close to double the size of the other four formats.
- So far, Parquet, CSV, Feather, and Pickle appear appropriate options for storing our DataFrame because all of them block roughly the same portion of your secondary storage for the same amount of data.
Experiment 2: Time taken to load and save

This is where we start noticing the downsides of using a CSV format.
- Let’s consider the load time alone for now. The time elapsed to read a CSV is almost three times that of the best available alternative here — pickle. Moreover, as we saw earlier, Pickle and CSV take up the same amount of space, so why choose the slower option?
- Regarding saving time, CSV is the most expensive option to choose from — consuming close to eight times that of Feather.
Apparently, while storing your DataFrame to a particular format, you are bound to use the same format again while loading. In other words, once you have stored your DataFrame as a pickle, you have no choice but to read it as a pickle file as well. Therefore, in the third bar chart above, we look at their total efficiency, i.e., load time + save time.
- Sadly enough, CSV isn’t the best choice we have got.
- Compared to Feather, Parquet and Pickle, a CSV is on an average 2.5 times slower than these formats, which is insanely high.
In my opinion, both Parquet and Feather are the best available file formats out there to choose from the six we have explored in this post.
Concluding Notes
I know CSVs are great. I love them too, and I am a fan of CSVs for countless reasons, such as:
- If needed, CSV allows me to read only a subset of columns, saving RAM and reading time.
- CSV is essentially a text file. Therefore, Pandas allows me to view top-n (say 5, 10, 15, etc.) rows present in the CSV.
- Excel is one of my favorite tools, and I can open a CSV directly in Excel.
However, CSV is killing your pipeline. It actually is. You are spending an enormous amount of time in reading and writing operations just because of having CSVs all over the place.
Unless you need to view your DataFrame outside a non-pythonic environment such as Excel, YOU DON’T NEED A CSV AT ALL. You should prefer Parquet, Feather or Pickle because, as we observed above, they provide significantly faster read and write operations than a CSV.
So next time when you are about to execute pd.to_csv(), think if you actually need a CSV.











