The world’s leading publication for data science, AI, and ML professionals.

Tips and Tricks for Loading Large CSV Files into Pandas DataFrames – Part 1

Learn how to read large CSV files to minimize memory usage as well as loading time

Photo by Jordan Harrison on Unsplash
Photo by Jordan Harrison on Unsplash

Most datasets you use in the real world are usually humongous, and they come in gigabytes and contain million of rows. For this article, I will discuss some techniques that you can employ when dealing with large Csv datasets.

When dealing with large CSV files, there are two main concerns:

  • The amount of memory used in loading large CSV files.
  • The amount of time spent in loading large CSV files.

Ideally, you want to minimize the memory footprint of your DataFrame, as well as the time loading it. In this article, I will walk you through using a sample dataset.

Our Dataset

For this article, I will use the Japanese Trade Statistics data available from https://www.kaggle.com/datasets/4e614ec846ab778f6a2ff166232d5a65f5e6786b4f5781690588bd2cccd71cb6?resource=download.

License type: CC BY-SA 4.0

This dataset contains trade data from 1988 to 2020. It contains more than 100 million rows and the CSV file takes up a whopping 4.5 Gb. And so it is an ideal dataset to illustrate the concepts in this article.

Loading the CSV File into a Pandas DataFrame

Let’s first start by loading the entire CSV file with more than 100 million rows. I am interested to see how long it takes to load the DataFrame, as well as its memory footprint:

import time
import Pandas as pd

start = time.time()

df = pd.read_csv("custom_1988_2020.csv")

print(time.time() - start, ' seconds')
display(df)
display(df.info())

The output is as shown:

The total memory footprint is a whopping 6.8 GB (!) and it took me 30 seconds to load it into a Pandas DataFrame.

For reference I am using the Mac Studio with 32GB of memory.

Examining the Columns

Let’s examine the columns in the dataframe:

df.columns

You should now realize that this CSV file does not have a header, and hence Pandas will assume that the first line in the CSV file contains the header:

Index(['198801', '1', '103', '100', '000000190', '0', '35843', '34353'], dtype='object')

Loading with Headers

Since the CSV file has no header, the least you could do is to use the header parameter to tell Pandas that there is no header in the CSV file:

# loading with no headers specified
df = pd.read_csv("custom_1988_2020.csv", header=None)
display(df)

Pandas will now automatically name the columns starting with 0, followed by 1, and so on.

From the description of the dataset on the https://www.kaggle.com/datasets/4e614ec846ab778f6a2ff166232d5a65f5e6786b4f5781690588bd2cccd71cb6?resource=download, we know the meaning of the various columns:

  • ym(Year + month)
  • exp_imp(export: 1, import: 2)
  • hs9(HS code)
  • Customs
  • Country
  • Q1
  • Q2(quantity)
  • Value(in thousands of yen)

Let’s name the columns using the names parameter:

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'])
display(df)

The DataFrame now has the following column names – ‘YearMonth’, ‘ExportImport’, ‘HSCode‘, ‘Customs‘, ‘Country‘, ‘Q1′, ‘Q2_Quantity’, ‘Value’.

Loading Specific Columns

Since the CSV file is so large, the next question you want to ask yourself is – do you really need all the columns? To load specific columns, you can use the usecols parameter to specify the columns that you want to load:

start = time.time()

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 usecols = ["YearMonth", "Value"])

print(time.time() - start, ' seconds')
display(df)
display(df.info())

As you can see from the output above, the memory footprint has been reduced to 1.7GB and the time taken to load it is now reduced to 17 seconds.

The usecols parameter also supports column-position index. The above can also be rewritten using column numbers – 0 and 7:

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 usecols = [0, 7])

print(time.time() - start, ' seconds')
display(df)

Note that you cannot use -1 to indicate the last column, like this:

usecols = [0, -1])   # -1 is not supported

The usecols parameter also supports lambda function. For example, if you want to retrieve all columns except the Country column, you can use the following lambda expression:

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 usecols = lambda column: column not in ['Country'])
display(df)

The Country column would now be excluded from the result.

The use of lambda function in the usecols parameter allows you to do some interesting things, such as loading columns whose name contains "Q", for example:

usecols = lambda column: "Q" in column

Or whose length of column name is more than, say, seven characters:

usecols = lambda column: len(column) > 7

Loading the First n Rows

In a lot of situations, you do not need all the rows in the entire CSV file. Perhaps the first 100 rows are sufficient. For this, you can use the nrows parameter to specify the first n rows that you want to load:

start = time.time()

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 nrows=100)

print(time.time() - start, ' seconds')
display(df[:15])
display(df.info())

From the result above, you see that it now only takes 0.1 second to load the first 100 rows and the resultant DataFrame only occupies 6.4 KB.

Skipping Rows

There are also times where you might want to skip certain rows in your CSV file. To do so, use the skiprows parameter:

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 skiprows=2,
                 nrows=100
)
display(df[:15])

The result above shows the first two lines of the CSV file is skipped:

You can also skip specific rows:

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 skiprows=[0,2,4],
                 nrows=100
)
display(df[:15])

The above result shows row 0, 2, and 4 were skipped:

You can also use a range object to specify the range of rows to skip:

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 skiprows=range(5,10),
                 nrows=100
)
display(df[:15])

The above result shows rows 5 to 9 were skipped. The value of the skiprows parameter can also be written using a lambda function, like this:

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 skiprows=lambda x: 5 <= x < 10,
                 nrows=100
)

Using the lambda function, you can skip all the even numbered rows:

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 skiprows=lambda x: x % 2 == 0,
                 nrows=100
)

print(time.time() - start, ' seconds')
display(df[:15])

The above result shows all even numbered rows were skipped:

Loading Specific Rows

Up till this point, you learned how to load the first n rows, as well as skip specific rows in CSV file. How about loading specific rows in a CSV file? There is no parameter that allows you to do that, but you can make use of the skiprows parameter to get what you want.

Using the lambda function in the skiprows parameter, you can specify which rows not to skip (which essentially means which rows you want to load):

start = time.time()

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 skiprows=lambda x: x not in [1,3],
                 nrows=100
)

print(time.time() - start, ' seconds')
display(df[:15])
display(df.info())

The above result shows that rows numbered 1 and 3 were retained:

The downside to this approach is that the entire CSV file must be scanned, hence the 20 seconds it took to load the only two rows.

Loading the Last n Rows

The final challenge I want to discuss is how to load the last n rows from a CSV file. While it is easy to load the first n rows, loading the last n rows is not so straight-forward. But you can make use of what you have learned up till this point to solve this problem.

First, count how many rows are there in the CSV file:

# read the last n rows
start = time.time()

row_count = sum(1 for l in open('custom_1988_2020.csv')) 

print(time.time() - start, ' seconds')
row_count

Since there are more than 100 million lines in the CSV file, it took about 10 seconds to count the number of rows. Also, remember that for this CSV file, there is no header. So 113607322 is the actual number of rows of records.

Then, to load the last 20 rows, use the skiprows parameter and pass it a lambda function to skip all the rows except the last 20:

# read the last n rows
start = time.time()

df = pd.read_csv("custom_1988_2020.csv", 
                 header=None, 
                 names=['YearMonth', 'ExportImport', 'HSCode', 'Customs', 
                        'Country', 'Q1', 'Q2_Quantity', 'Value'],                 
                 skiprows=lambda x: 0 <= x < row_count - 20,
                 nrows=100)

print(time.time() - start, ' seconds')
display(df)
display(df.info())

The result shows the last 20 rows loaded into the Pandas DataFrame.

Like the previous section, the downside is that the entire CSV file must be scanned during the loading process (hence the 22 seconds to load the DataFrame).

If you like reading my articles and that it helped your career/study, please consider signing up as a Medium member. It is $5 a month, and it gives you unlimited access to all the articles (including mine) on Medium. If you sign up using the following link, I will earn a small commission (at no additional cost to you). Your support means that I will be able to devote more time on writing articles like this.

Join Medium with my referral link – Wei-Meng Lee

Summary

In this article, I have gone through quite a number of techniques to load a Pandas DataFrame from a CSV file. Often, there is no need to load the entire CSV file into the DataFrame. Not only do you save time by loading only what you need, you also save the memory needed to hold your DataFrame in memory. In the next article, I will show you the techniques to reduce the memory footprint of your DataFrame. Stay tuned!


Related Articles