Learn how to fit 20GB of CSV files into 16GB of RAM.

So you have some experience with Pandas, and you’re aware of its biggest limitation – it doesn’t scale all that easily. Is there a solution?
Yes – Dask Data Frames.
Most of Dask API is identical to Pandas, but Dask can run in parallel on all CPU cores. It can even run on a cluster, but that’s a topic for another time.
Today you’ll see just how much faster Dask is than Pandas at processing 20GB of CSV files. The runtime values will vary from PC to PC, so we’ll compare relative values instead. For the record, I’m using an MBP 16" 8-core i9 with 16GB of RAM.
Start the series from the beginning:
- Dask Delayed – How to Parallelize Your Python Code With Ease
- Dask Arrays – How to Parallelize Numpy With Ease
The article is structured as follows:
- Dataset Generation
- Processing a Single CSV File
- Processing Multiple CSV Files
- Conclusion
Dataset Generation
We could download a dataset online, but that’s not the point of today’s article. We’re only interested in size and not what’s inside.
For that reason, we’ll create a dummy dataset with six columns. The first column is a timestamp – an entire year sampled at one-second intervals, and the other five are random integer values.
To make things more complex, we’ll create 20 files, one for every year from 2000 to 2020.
Before starting, make sure to create a data
folder right where your notebook is located. Here’s the code snippet for creating the CSV files:
import numpy as np
import pandas as pd
import dask.dataframe as dd
from datetime import datetime
for year in np.arange(2000, 2021):
dates = pd.date_range(
start=datetime(year=year, month=1, day=1),
end=datetime(year=year, month=12, day=31),
freq='S'
)
df = pd.DataFrame()
df['Date'] = dates
for i in range(5):
df[f'X{i}'] = np.random.randint(low=0, high=100, size=len(df))
df.to_csv(f'data/{year}.csv', index=False)
You can now use a basic Linux command to list the data
directory:
!ls -lh data/
Here are the results:

As you can see, all 20 files are around 1GB in size (1.09 to be more precise). The above code snippet took some time to execute, but it’s still way less than downloading a 20GB file.
Up next, let’s see how to process and aggregate a single CSV file.
Processing a Single CSV File
Goal: Read in a single CSV file, group the values by month and calculate the total sum for every column.
Loading a single CSV file with Pandas can’t be any easier. The read_csv()
function accepts the parse_dates
parameter, which automatically converts one or more columns to date type.
This comes in handy because we can directly use the dt.month
to access month values. Here’s the complete code snippet:
%%time
df = pd.read_csv('data/2000.csv', parse_dates=['Date'])
monthly_total = df.groupby(df['Date'].dt.month).sum()
And here’s the total runtime:

Not too bad for a 1GB file, but the runtime will depend on your hardware. Let’s do the same thing with Dask. Here’s the code:
%%time
df = dd.read_csv('data/2000.csv', parse_dates=['Date'])
monthly_total = df.groupby(df['Date'].dt.month).sum().compute()
As always with Dask, no processing is done until the compute()
function is called. You can see the total runtime below:

Let’s compare the differences:

It isn’t a significant difference, but Dask is overall a better option, even for a single data file. This is a good start, but we’re really interested in is processing multiple files at once.
Let’s explore how to do so next.
Processing Multiple CSV Files
Goal: Read all CSV files, group them by year values and calculate the total sum for every column.
Working with multiple data files with Pandas is a tedious task. In a nutshell, you have to read the files one by one and then stack them vertically.
If you think about it, a single CPU core loads the datasets one at a time while the other cores sit idle. It’s not the most efficient way.
The glob
package will help you handle multiple CSV files at once. You can use the data/*.csv
pattern to get all the CSV files in the data
folder. Then, you’ll have to read them one by one in a loop. Finally, you can concatenate them and do the aggregation.
Here’s the complete code snippet:
%%time
import glob
all_files = glob.glob('data/*.csv')
dfs = []
for fname in all_files:
dfs.append(pd.read_csv(fname, parse_dates=['Date']))
df = pd.concat(dfs, axis=0)
yearly_total = df.groupby(df['Date'].dt.year).sum()
And here are the runtime results:

15 and a half minutes seems like a lot, but you have to consider that a lot of swap memory was used in the process, as there isn’t a way to fit 20+GB of data into 16GB of RAM. If the notebook crashes completely, use a smaller number of CSV files.
Let’s see which improvements Dask has to offer. It accepts the glob pattern to the read_csv()
function, which means you won’t have to use loops. No operation is done until the compute()
function is called, but that’s how the library works.
Here’s the entire code snippet for the same loading and aggregation as before:
%%time
df = dd.read_csv('data/*.csv', parse_dates=['Date'])
yearly_total = df.groupby(df['Date'].dt.year).sum().compute()
Here are the runtime results:

Let’s compare the differences:

As you can see, the difference is more significant when processing multiple files – around 2.5X faster in Dask. A clear winner, no arguing here.
Let’s wrap things up in the next section.
Conclusion
Today you’ve learned how to switch from Pandas to Dask, and why you should do so when the datasets get large. Dask’s API is 99% identical to Pandas, so you shouldn’t have any trouble switching.
Keep in mind – some data formats aren’t supported in Dask – such as XLS, Zip, and GZ. Also, the sorting operation isn’t supported, as it isn’t convenient to do in parallel.
Stay tuned to the last part of the series – Dask Bags – which will teach you how Dask works with unstructured data.
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.
Learn More
- Top 3 Reasons Why I Sold My M1 Macbook Pro as a Data Scientist
- How to Schedule Python Scripts With Cron – The Only Guide You’ll Ever Need
- Dask Delayed – How to Parallelize Your Python Code With Ease
- How to Create PDF Reports With Python – The Essential Guide
- Become a Data Scientist in 2021 Even Without a College Degree
Stay Connected
- Follow me on Medium for more stories like this
- Sign up for my newsletter
- Connect on LinkedIn