The world’s leading publication for data science, AI, and ML professionals.

Pandas II: drop(), isna()

In Pandas I tutorial, we used Jupyter Notebooks and Pandas to begin working the Metal Bands by Nation spreadsheet. We used a few different…

Pandas for Data Science: A Beginner’s Guide to Missing Values, Part II

Learn how to deal with missing values in Python data science projects with Pandas using drop(), drop_duplicates(), and isna().

In Pandas Essentials I, I described how to begin analyzing and manipulating data in Pandas with Python. In this tutorial, I will explain how to continue working with the Metal Bands by Nation data set by addressing missing values.

In the first section, Dealing with Missing Values, I provide some basic information about dealing with missing values. If you are just here for code snippets, that information will be in the second section Data Analysis in Jupyter Notebooks: Missing Values.

I. Dealing with Missing Values

Data is often imperfect. One of the most glaring imperfections that can potentially wreck havoc on your project is missing values.

A. What is a missing value?

In tabular data (or data that is arranged in a schema of rows and columns), when there is a column that contains rows that blank, the value in those blank cells is considered to be missing.

While there are some common procedures for Handling Missing Values, every data set is unique and it is important to assess your data before you determine how to handle them. In this tutorial, I will explore some of the most basic ways to correct missing values in with Pandas in Python.

B. What are NaNs?

When the data is imported into Python via Pandas, the placeholder value that will appear in that cell’s location is generally NaN (Not-A-Number). For datetime data types, NaT (Not-A-Time) is evoked.

C. Why is the data missing?

Data can be missing for a variety of reasons. These include:

  1. Human/Machine Error – Someone made a mistake in the data entry phase. (e.g. Temperatures for three days were not recorded due to an issue with a sensor.)
  2. Non-response – There may have been no response provided. (e.g. Survey respondents who chose not to disclose their income.)
  3. Attrition – Data is unavailable for some observations because they are no longer available for study. (e.g. Participants that drop out of a longitudinal study prevent certain metrics from being recorded in the latter parts of the study.)
  4. Redaction – Data is only accessible to users with specific permissions. (e.g. Diagnoses are only included for patients who have signed the HIPPA form)

In this tutorial, we are focusing on the basic mechanics of cleaning up missing values to improve the integrity of your data. However, during later phases of the project, it will be important to considering the deeper underlying nature of the missing values within the context of the domain of your investigation.

D. How is the data missing?

Missing data falls into three main types:

  1. Missing Completely at Random (MCAR)
  2. Missing at Random (MAR)
  3. Missing Not Random (MNAR)

Take a look at this resource to learn more about the different types of missing values.

E. Why are missing values a problem?

Keep reading the next section to see some examples of different kinds of missing values, the problems associated with them, and how you can handle them.

II. Identifying Missing Values

In this tutorial, I discussed how .info() can be used to get a quick summary of the of the data.

df.info()

Not many values are explicitly missing

II. Data Analysis Using a Jupyter Notebook

  1. Return to your project folder in your Terminal. Enter the following:
$ Jupyter Notebook

When Jupyter launches, select your project notebook. In the previous tutorial, we did some preliminary analysis using Pandas methods. To re-execute that code and start where we left off, select Kernel > Restart and Run all from the top of the page.

A. Drop a Column with Pandas

Dropping an entire column because some values are missing is a heavy-handed approach to removing missing values.

  1. In the third cell of the notebook above, we viewed the top five rows. One thing that we can do to clean and improve the data frame is to get rid of the "Unnamed: 0" column. It seems to be a duplicate of the index.

  1. To get rid of that row, we can use drop(). This method will allow us to drop either a row or a column in a data frame. Enter:
df = df.drop("Unnamed: 0", axis=1) 

How this breaks down is we are using = to reassign our df variable to a modified version of itself. On the right side of this assignment, drop() is applied to the dataframe.

For this to work, keep in mind the following:

  • We need to provide drop() with the necessary arguments, in this case the name of the column (Unnamed: 0) and the axis.
  • The default value of the axis is 0, which indicates rows. If you do not specify an axis, Pandas assumes you are referring to a row location and will throw an error.
  • To indicate that "Unnamed: 0" is being removed from the columns, enter axis=1 as the argument.

Keep in mind that if you want to drop more than one column as a time, put the columns within square brackets to indicate more than one column, like this:

df = df.drop(["Unnamed: 0", "split", "another_column"], axis=1)

III. Dropping Duplicates with Pandas

  1. Since this is a data set of bands by nation, we can use band name as our primary key, or the unique identifying value for each observation (or row) in the data. It follows that there should be no duplicate values in the band_name column.
df[df['band_name'].duplicated()]

  1. For this example, it is likely that these duplicate band names are true duplicates. However, it cannot hurt to validate our assumption that the band names are indeed unique to each observation. In addition, in other data sets the identifying column may not be so obvious, so it is good practice to take a closer look at duplicates. Try this line of code in your notebook to take a closer look at the duplicates:
df[df['band_name'].duplicated()]

IV. Filling Missing Values with Pandas

  1. Now that we’ve made that improvement, we can look closer at the rest of the data. We can check how many nulls are in each column by chaining together the methods isna() and sum():
df.isna().sum()
  1. It seems that there are 8 rows missing from "origin". Let’s examine the rows where the data is missing by entering:
df[df['origin'].isna()]

Now, your notebook should look something like this:

When we enter this line, what we are doing is asking Pandas to display the parts of df such that the column "origin" has a null value. Using the df[df[column]] syntax allows us to slice view portions of the data frame that meet a certain criteria.

When dealing with missing values, using discretion is key. There are a variety of ways we can deal with missing values and there are trade-offs to any approach. Since we are trying to determine what factors influence the number of fans a metal band has, we will keep that in mind for the decision.

Two ways that we might deal with this case are elimination or substitution. We can eliminate the missing values by removing those rows. Since we have 5000 rows of data, our analysis will not be radically affected by losing eight rows. However, we could also minimize the amount of data lost by substituting "Missing". That way, we can still use the other data for those bands in our analysis.

Since this is on a small scale, this choice is fairly trivial. However, when a data set is larger and the amount of missing data is larger, you’ll want to do what you can to avoid throwing out too much data.

  1. I’m going to go with the second option and replace those values with "Missing". All we need to do is us fillna():
df = df.fillna("Missing")

Great, now that we got rid of that extra column and fixed our nulls, we can begin to inspect the data types of the other columns.

  1. If you’re following the Data in a Day series, save your notebook as MyProject.ipynb, so we can continue where we left off. Keep reading Pandas III to continue.

III. What Did We Do?

  1. Reopened the a project in Jupyter Notebooks.
  2. Used drop() to eliminate a column in a Pandas data frame.
  3. Chained isna() and sum() to return the number of missing values for all columns 4.
  4. Used fillna() to fill missing values

IV. What is Next? In Pandas III, we are going analyze and transform the "split" and "formed" columns in the Metal Bands by Nation data set.


Related Articles