The world’s leading publication for data science, AI, and ML professionals.

Advanced Tips on How to Read CSV Files into Pandas

Learn 5 tips that will save you a world of pain.

Photo by Sam Dan Truong on Unsplash
Photo by Sam Dan Truong on Unsplash

Overview of Your Journey

  1. Setting the Stage
  2. Importing CSV Files the Plain Vanilla Way
  3. First Tip: Import Only Selected Columns
  4. Second Tip: Handling True and Fales Values
  5. Third Tip: Taking a Peek at the Data
  6. Fourth Tip: Set an Index Column
  7. Fifth Tip: Handling Missing Values like a Pro
  8. Wrapping Up

1 – Setting The Stage

It is incredibly common to load data into pandas when working in data science and data analysis. There are loads of different formats to consider, like Csv files, Excel files, JSON files, raw textfiles, parquet files, etc.

When it comes to CSV files, you quickly learn that Pandas can import CSV files easily with the method pd.read_csv(). However, what is less known is the insane amount of optional arguments that can be passed into the method pd.read_csv(). This gives you great customization! If you think I am exaggerating, check out the following screenshot from the Pandas documentation:

Screenshot from https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Screenshot from https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

It’s pretty overwhelming, right? 😧 Don’t worry, I will teach you the 5 most important ones to know.

Without some of these useful optional arguments, you will have to do extra preprocessing. This is usually slower and requires a higher computational cost. In short, some of these optional arguments are great timesavers that you can utilize to save you from boring work.

Say goodbye to needless preprocessing and say hello to efficient and clean code for importing CSV files.

Prerequisites: You should have Pandas installed and know what a CSV file is. Other than this, there are no prerequisites. Let’s go!


2 – Importing CSV Files the Plain Vanilla Way

Before you learn the cool tips, let’s make sure that you know the basic way of importing CSV files in Pandas. Let’s say that you have a CSV file called employees.csv with the following information:

Name,Age,Wage
Eirik,27,75000
John,33,85000
Janet,25,78000
Phil,23,66000

Then you can import the whole file employees.csv into Pandas with the following code snippet:

import pandas as pd
employees = pd.read_csv("filepath_to_employees.csv")

The dataframe employees you get contains the following information:

Pandas dataframe from simple CSV file.
Pandas dataframe from simple CSV file.

The code above is the most basic way of using the function read_csv() in the Pandas library. It imports the whole CSV file.

This is great if the CSV file is perfectly formatted, there are no missing values, and you need all the columns and rows. In practice, however, these requirements are more alike to daydreams than reality for many. Let’s see how to use optional arguments to handle more tricky cases 🔥


3 – First Tip: Import Only Selected Columns

Often you only need some of the columns in a CSV file rather than all of them. One solution to this is to import the whole CSV file, and then afterwards filter away the columns as follows:

employees = pd.read_csv("filepath_to_employees.csv")
employees = employees[["Name", "Wage"]]

The dataframe employees now contains the following information:

Pandas dataframe with selected columns.
Pandas dataframe with selected columns.

While this approach certainly works, it is inefficient in both code length and performance. You import the whole Age column without ever using it!

It is much better to tell the function read_csv() to only import the columns you need. You can do this with the optional argument usecols as follows:

employees = pd.read_csv(
  "filepath_to_employees.csv", 
  usecols=["Name", "Wage"]
)

This produces the exact same result. You avoid importing loads of data you don’t need, and you avoid writing Pandas code to filter the useless columns away. That’s two birds with one stone 😄

The argument usecols typically expects a list of column names you want to import. However, you can also pass a (typically anonymous) function that determines whether a column is selected or not. For more on this, check out the usecols optional argument in the Pandas documentation.


4 – Second Tip: Handling True and False Values

Often CSV files contain values indicating that something is true or false. The problem is that this indication can come in a variety of flavours. To see an example, consider the CSV file employees.csv with the following extra column:

Name,Age,Wage,Insurance
Eirik,27,75000,Yes
John,33,85000,No
Janet,25,78000,yes
Phil,23,66000,no

There are two immediate problems with truth values in this CSV file:

  • Rather than the Python recognized keywords True and False the CSV file uses the words Yes and No.
  • Both lowercase and uppercase is used for Yes/yes and No/no. Maybe the data was entered by a HR employee manually? In any case, you have to consider this.

By doing a plain import statement Pandas has no understanding of what the words Yes and No mean: The following code

employees = pd.read_csv("filepath_to_employees.csv")

simply gives the dataframe:

Dataframe with an Insurance column that is hard to work with.
Dataframe with an Insurance column that is hard to work with.

It is now very cumbersome to filter the column based on the values in the Insurance column.

You could manually rename the values in the column Insurance with e.g. the map() function. However, if you have come this far in the blog post, you know that this is not the way.

The best way to handle this is to use the optional arguments true_values and false_values in the function read_csv(). Both arguments simply require a list of words that should be translated to the Python keywords True and False:

employees = pd.read_csv(
    "filepath_to_employees.csv", 
    true_values=["Yes", "yes"], 
    false_values=["No", "no"]
)

The dataframe now looks like the following:

Dataframe with an Insurance column that is easy to work with.
Dataframe with an Insurance column that is easy to work with.

5 – Third Tip: Taking a Peek at the Data

Photo by Dmitry Ratushny on Unsplash
Photo by Dmitry Ratushny on Unsplash

This is a short one. If you have a massive CSV file, but only want to take a peek at the data, then loading the whole file is a massive time waste. Instead, you can use the optional argument nrows to specify how many rows of the CSV file should be read in.

Let’s consider the following simple CSV file numbers.csv below:

Numbers,Squares
1,1
2,4
3,9
4,16
5,25
6,36
7,49
8,64
9,81
10,100
.
.
.
1000000,1000000000000

It is a file with 1.000.000 numbers. I know this is very artificial, but it serves as a nice example. The following code will then only import 1.000 rows:

numbers = pd.read_csv(
  "filepath_to_numbers.csv",
  nrows=1000
)

This code only picks out the 1000 first rows and you get the dataframe:

The 1000 first rows.
The 1000 first rows.

Often, selecting a sample like this is sufficient for exploring the data.

Note: If the data you are importing is sorted in some way (like a time-series), then importing the 1000 first rows might give a very skewed picture of the data.

In the above example, you might also want to set the index column to be the Numbers column in the CSV file. The dataframe as it stands now has a lot of redundant information. This is the next tip!


6 – Fourth Tip: Set an Index Column

Consider the numbers dataframe

The index column and Numbers column is essentially the same.
The index column and Numbers column is essentially the same.

Sometimes you want to set a column (in this case the Numbers column) to be the index column. This happens all the time in time-series data, where the time of an event is a great column to index the data on. Pandas will by default just create its own index column. However, you can decide your own index column by using the optional argument index_col as follows:

numbers = pd.read_csv(
  "filepath_to_numbers.csv",
  nrows=1000,
  index_col="Numbers"
)

With this code, you get the following dataframe:

The Numbers column is now the index.
The Numbers column is now the index.

The optional argument index_col typically expects a string corresponding to a column name in the CSV file. However, if you pass in a list of strings, then you get a Pandas MultiIndex 👍


7 – Fifth Tip: Handling Missing Values like a Pro

The Pandas function read_csv() is not stupid. Whenever an entry in a CSV file has values NULL, NaN, or n/a, then read_csv() will interpret the value as missing. There is a full list of values that Pandas will recognize as missing in the read csv documentation. Pretty clever, right?

However, sometimes CSV files contain missing values in subtle formats. Let’s look back at the employees.csv file that you considered earlier in the blog post:

Name,Age,Wage
Eirik,27,75000
John,33,85000
Janet,25,78000
Phil,23,66000

Let’s now say that a new employee called "Sarah" is set to start. However, her new wage has not been registered by HR yet, so it is set to zero until the correct sum is entered:

Name,Age,Wage
Eirik,27,75000
John,33,85000
Janet,25,78000
Phil,23,66000
Sarah,25,0

Obviously, Sarah’s wage should be considered a missing value since it has not been written into the systems yet. However, Pandas will not understand this and will simply treat it as if Sarah is an intern being paid in "Experience" 😡

One solution to this is to filter out the employees with wage zero after the data has been imported. But, as you surely expect by now, you can use an optional argument in the read_csv() function. Use the optional argument na_values as follows:

employees = pd.read_csv(
    "path_to_employees.csv",
    na_values=["0"]
)

Now the dataframe has correctly identified Sarah’s wage as missing:

Sarah's wage is correctly identified as missing.
Sarah’s wage is correctly identified as missing.

If there are more values you want to indicate as missing, then simply supply them to the list passed to the na_values argument.

Warning: If one of your other columns have valid values of zero, then this approach will mark them as missing as well. Not good. To avoid this, you can also pass a dictionary to na_values. The keys of the dictionary are the column name in question (in your case "Wage") and the value is the missing value (in your case "0"). It’s always a good sanity check to run Pandas’ info() method to check that you have not introduced wrongful missing values.


8 – Wrapping Up

Photo by Joshua Hoehne on Unsplash
Photo by Joshua Hoehne on Unsplash

I have shown you 5 tips for importing CSV files in Pandas. I hope that you learned a few new things and that you will be able to use this in future projects. If you need to know more, then check out the Pandas documentation.

Like my writing? Check out some of my other posts for more Python content:

If you are interested in Data Science, programming, or anything in between, then feel free to add me on LinkedIn and say hi ✋


Related Articles