The world’s leading publication for data science, AI, and ML professionals.

A Complete Pandas Glossary for Data Science

Your go-to resource to learn Pandas fundamentals

Like most others, I tried to learn Pandas through boot camps – unfortunately, the problem with boot camps is that you forget everything pretty quickly if you don’t practice what you learn!

At the same time, I found that there was a need for a central Pandas resource that I could refer to when working on personal Data science projects. That’s how this came into fruition. Use this as a resource to learn Pandas and also to refer to!

Table of Content

  1. Setup
  2. Creating and Reading Data
  3. Manipulating DataFrames
  4. Summary Functions
  5. Mapping Functions
  6. Grouping and Sorting Variables
  7. Handling Missing Data

Setup

Import Pandas library

import pandas as pd

Creating and Reading Data

Create a DataFrame

A DataFrame is simply a table made up of multiple arrays. In the example below, the code would create a table with two columns, ABC and DEF.

pd.DataFrame({'ABC':[1,2,3],'DEF':[4,5,6]},index=[1,2,3])

Create a Series

A Series is a sequence of values, also known as a list. From a visual perspective, imagine it being one column of a table.

pd.Series([1,2,3],index=[], name ='ABC')

Read a CSV file into a data frame

The most common way of getting our data. This converts a CSV file into a DataFrame.

# example
df = pd.read_csv("filename.csv", index_col=0)

Convert a DataFrame into a CSV file

Vice versa, if you want to convert a DataFrame into a CSV, you can use the code below:

# example
df.to_csv("filename.csv", index_col=0)

Determine the shape of a DataFrame

This tells you how large a DataFrame is and is in the format (rows, columns).

df.shape()

See the first 5 rows of a DataFrame

If you want to get a visual idea of what a DataFrame looks like, .head() returns the first 5 rows of the given DataFrame.

df.head()

See the data type of one or more columns

# For one column
df.variable.dtype
# For all columns
df.dtypes

Convert a column to another data type

This is useful if you want to convert integers to floats (or vice versa).

df.variable.astype()

Manipulating DataFrames

Selecting a Series from a Dataframe

# a) Method 1
df.property_name
# b) Method 2
df['property_name']

Indexing a Series

# if you want to get the first value in a series
df['property_name'][0]

Index-based Selection

Index-based selection retrieves data based on its numerical position in the DataFrame. It follows a rows-first, columns-second format. Iloc’s indexing scheme is such that the first number is inclusive and the last number is exclusive.

df.iloc[]

Label-based Selection

Label-based selection is another way to index a DataFrame, but it retrieves data based on the actual data values rather than the numerical position. Loc’s indexing scheme is such that the both the first and last values are inclusive.

df.loc[]

Set index using existing columns

Because label-based selection relies on the index of the DataFrame, you can use .set_index() to assign a column to the index.

df.set_index("variable")

Conditional Label-based Selection

We can filter out a DataFrame using Label-based selection too.

# a) Single Condition 
df.loc[df.property_name == 'ABC']
# b) Multiple conditions using AND
df.loc[df.property_name == 'ABC' & df.property_name == 'DEF']
# c) Multiple conditions using OR
df.loc[df.property_name == 'ABC' | df.property_name == 'DEF']

Select where value is in a list of values

We can use isin() to filter a DataFrame as well. If you know SQL, it’s similar to the WHERE ___ IN() statement.

df.loc[df.property_name isin(['ABC','DEF'])

Select where values are null / aren’t null

The first line of code will filter a DataFrame to only show rows where the property name is null. Vice versa, the second line of code with the filter it so that the property name is not null.

df.loc[df.property_name.isnull()]
df.loc[df.property_name.notnull()]

Adding a new column

df['new_column'] = 'ABC'

Renaming a column

You’ll often want to rename a column to something easier to refer to. Using the code below, the column ABC would be renamed to DEF.

df.rename(columns={'ABC': 'DEF'})

Summary Functions

.describe()

This gives a high-level summary of a DataFrame or a variable. It is type-sensitive, meaning that its output will be different for numerical variables compared to string variables.

df.describe()
df.variable.describe()

.mean()

This returns the average of a variable.

df.variable.mean()

.unique()

This returns all of the unique values of a variable.

df.variable.unique()

.value_counts()

This shows a list of unique values and also the frequency of occurrence in the DataFrame.

df.variable.value_counts()

Mapping Functions

.map()

Mapping is used to transform an initial set of values to another set of values through a function. For example, we could use mapping to convert the values of a column from meters to centimeters or we could normalize the values.

.map() is used to transform a Series.

df.numerical_variable.map()

.apply()

.apply() is similar to .map(), except that it transforms the entire DataFrame.

df.numerical_variable.apply()

Grouping and Sorting

.groupby()

Get the count for each value of a variable (_same as valuecounts)

df.groupby('variable').variable.count()

Get the min value for each value of a variable

df.groupby('variable').variable.min()

Get a summary (length, min, max) for each value of a variable

df.groupby(['variable']).variable.agg([len, min, max])

Multi-indexing

df.groupby(['variable_one', 'variable_two'])

Sorting a DataFrame

Sorting by one variable

df.sort_values(by='variable', ascending=False)

Sorting by multiple variables

df.sort_values(by=['variable_one', 'variable_two'])

Sorting by index

df.sort_index()

Handling Missing Data

Handling missing data is one of the most important steps in EDA. Below are a number of ways to handle missing data.

Count number of null values in each column

df.isna().sum()

Drop rows with null values

If you have a DataFrame with a large number of rows and you can afford to remove rows with null values completely, then .dropna() is a useful tool.

df.dropna()

Drop columns with null values

This is similar to above except that it drops any columns with null values rather than the rows.

df.dropna(axis=1)

Fill missing values

If you would rather fill missing values instead of removing the row or column completely, you can use the code below:

df.variable.fillna("n/a")

Replace values

Let’s say there’s a DataFrame where someone already filled missing values with "n/a", but you want the missing values to be filled with "unknown". Then you can use the following code below:

df.variable.replace("n/a", "unknown")

Combining Data

.concat()

This is useful when you want to combine two DataFrames that have the same columns. For example, if we wanted to combine January sales and February sales together to analyze longer-term trends, you could use the following code:

Jan_sales = pd.read_csv("jan_sales.csv")
Feb_sales = pd.read_csv("feb_sales.csv")

pd.concat([Jan_sales, Feb_sales])

.join()

If you want to combine two columns that have a common index (e.g. customer_id), then you can use .join().

Use the parameter, on, to determine what column to join on.

To determine if it’s a left, right, inner, or outer join, you use the parameter, how.

# example
table_1.join(table_2, on='customer_id', how='left')

_If you don’t know about SQL joins, read here. It’s essentially the same idea._

I hope this is helpful! Please provide comments if you feel that I missed something or if something isn’t clear. Thanks!

Thanks for Reading!

If you like my work and want to support me…

  1. The BEST way to support me is by following me on Medium here.
  2. Be one of the FIRST to follow me on Twitter here. I’ll be posting lots of updates and interesting stuff here!
  3. Also, be one of the FIRST to subscribe to my new YouTube channel here!
  4. Follow me on LinkedIn here.
  5. Sign up on my email list here.
  6. Check out my website, terenceshin.com.

Related Articles