Pandas Data Wrangling Cheat Sheet 2021

Published in

Towards Data Science

9 min readApr 11, 2021

To excel data analysis/data science/machine learning in Python, Pandas is a library you need to master. Here is a cheat sheet of some of the most used syntax that you probably don’t want to miss.

The Pandas package is the most imperative tool in Data Science and Analysis working in Python nowadays. The powerful machine learning and glamorous visualization tools may have drawn your attention, however, you won’t go anywhere far if you don’t have good skills in Pandas.

So today I gathered some of the most used Pandas basic functions for your reference. I will start with creating a dataframe, and follow with data reshaping, strings look up, value sorting, etc.

Content

Dataframe Creation

I believe it is worth a while to mention the basic structure of Pandas dataframe. When it comes to data, in most cases you will have variables and the observed values of your variables. In Pandas, each variable is stored as a column, while all the observations related to this variable is stored as rows. For example, if you have 20 observations for one variable A, in Pandas it will be presented as “Column A” with 20 rows of data.

Alright, let’s start!

Create empty dataframe

import pandas as pd
# These are the 3 basic territories in Hong Kong. They are Hong Kong, Kowloon, and the New Territories.col=[‘香港’,’九龍’,’新界’]df = pd.DataFrame(columns=col)
df

Returns:

Create dataframe with lists.

Personally, I prefer creating a dataframe in this way.

col=[‘香港’,’九龍’,’新界’]
ind=[1990,2000,2010]
Data= [[‘98731’,’34444',’99384'],[‘34323’,’44243',’88543'],[‘2222’,’4324',’3432']]df = pd.DataFrame(Data, columns=col, index=ind)
df

Returns:

Create dataframe with dictionaries.

Data= {‘A’:[98731,34444,99384],‘B’:[34323,44243,88543],‘C’:[2222,4324,None]}df = pd.DataFrame(Data)
df

Use “None” to represent empty value in Pandas.

Rename a Column

df = df.rename(columns={“A”:”香港”})df

Returns:

Insertion

Column insertion

# inserting column Dd = [“x”,”y”,”z”]df[“D”] = d
df

Returns:

Row insertion

# inserting the 4th rownew_row = {“A”:98989,”B”:77889, “C”:None, “D”:”z”}df = df.append(new_row, ignore_index=True)
df

Deletion

drop a column

# to drop several columns, you can use a list. 
# E.g. df.drop(["C","B"],axis=1)
# to make permanent change, add inplace=True
# dropping column C onlydf.drop(“C”,axis=1)

Drop a row

# to make permanent change, inplace=True
# dropping a row with index "0"df.drop(0)

Drop duplicates

df.drop_duplicates()

Our df dataframe remains the same, as we don’t have a duplicate here.

Deal with Null Values

More often than not, we have null (empty) values in our data. Usually, we will either take those empty values away, or substitute it with some other meaningful values, e.g. the mean of the column.

Check out null values

pd.isnull(df)

we can see there are 2 null values in column C.

Locate null value horizontally

# checking horizontally, from Col A to Dpd.isnull(df).any(axis=0)

Locate null value vertically

# checking vertically, from index 1 to 4pd.isnull(df).any(axis=1)

Fill all the NaN with some meaningful values

df2 = df.copy()df3 = df2.fillna(0) # fill with 0df4 = df2.fillna(df.mean()) # fill with the mean of the columndf5 = df2.fillna(df.B.max()) # fill with the max of colum B

In this case, it seems it is the most reasonable to fill NaN values in column C with the mean of its own column.

Reshaping

Another common problem a data scientist has to deal with is the shape of the data. As different machine learning model has a different data shape requirement, so we have to reshape the data to the model’s requirement. For example, my favorite LSTM model takes in 3-dimensional data, therefore we need to change most 2-D data set to a 3-D data set.

Melt all columns into one

# gather all columns, and all observations into one columnpd.melt(df2)

Concat — stacking up

This is similar to UNION ALL in SQL syntax. It stacks up values with the same column names.

# similar to UNION ALL in SQLpd.concat([df,df]) # join vertically

Concat — side by side

# join horizontallypd.concat([df,df], axis=1)

Groupby

This is a terribly popular technique which appears in many computer languages. In the example below, we group the data by the letters in column D, and showing only the summation values of each of it in column B.

# SELECT B, SUM(B) FROM df GROUP BY D ORDER BY Bdf.groupby(“D”)[“B”].sum()

Basic Data Observation

After all the data munging, and transforming, it is time to check what we have done.

Print the first few rows

# the default is 5 if you leave the bracket empty
# we are showing the first 3 rows onlydf.head(3)

Print the last few rows

# the default is 5 if you leave the bracket empty
# we are showing the last 2 rows onlydf.tail(2)

Randomly print a few rows

# randomly sampling only 50% of the data setdf.sample(frac=0.5)

Print a concise summary of a DataFrame.

df.info()

Generate descriptive statistics

This is a handy yet simple one to visually locate which variables are having outliners.

# only describing the numerical columns. Column D is a string column, so nothing is shown.df.describe()

Query / Retrieving data

After getting the shape/form of dataframe we require, we will want to retrieve the information from the data. The query function is one of the most used function in Pandas for data retrieving.

Sorting

Pandas share some SQL’s characteristics as well. In sorting, the default is ascending, which is similar to that in SQL’s ORDER IN.

# sort values according to Hong Kong
# default is ascendingdf.sort_values(“香港”, ascending=False)

Reset index

Very often, after sorting, we have to reset the index to the row numbers. This is the syntax for it.

df.reset_index()

One condition

# returns a dataframe with one condition, which is B has to be bigger than or equal 50000df.query(“B >= 50000”)

The below syntax yields the same result, just retrieving differently.

df[df[“B”]>=50000]

Instead of printing the entire dataframe, the following retrieves column C only.

df.C[df[“B”]>=50000]

Only 2 rows of column C are printed.

Multiple conditions — “and”

# returns a dataframe with 2 conditions, which are B has to be bigger than or equal 50000, and “香港” has to be smaller than 99000.df[(df[“B”]>=50000) & (df[“香港”] < 99000)] # no and

Multiple conditions — “or”

# returns a dataframe with 2 conditions, which are B has to be bigger than or equal 50000, or “香港” has to be smaller than 49000.df[(df["B"]>=50000) | (df["香港"] < 49000)] # no or

String matches

To find a string occurrence in a dataset, we can use ‘str.find()’, or ‘str.contains()’. The ‘str.find()’ method returns an integer value. If the substring exists inside the string, it returns the index of the first occurrence of the substring. If substring doesn’t exist inside the string, it returns -1.

For the str.contains, it returns boolean values True or False.

# Within column D, check if "z" exits.print(df[“D”].str.find(“z”))print(df[“D”].str.contains(“z”,case=False))

Select data with loc and iloc

# return the entire roll
# with a conditional statement
df.loc[df['D'] == 'z']# return the entire first roll
df.loc[0]# select one specific value row label and column label
df.loc[0, "D"]# select specific value row index and column index
df.iloc[0,3]

Query with regex

There are many native functions in Pandas supporting regex, namely, count(), replace(), contains(), extract(), findall(), match(), split(), and rsplit().

# searching for rows with single letter in D column.df[df[‘D’].str.contains(r’[a-z]+’)] # \D

It returns the entire dataframe as all the variables in column D are letters.

The Largest and smallest

# Return numbers of rows with the largest and smallestdf.nlargest(2, “B”)

# Return 2 rows with the smallest number in column Bdf.nsmallest(2, “C”)

Bonus: Dealing with Big Data:

Microsoft Excel can only handle 1000 rows of data well without lagging. However if the data set is beyond 10k rows, it will freeze often. For Python Pandas, it handles 1 million of rows without any problem (at least my experience tells me so, my equipment is only an old Macbook Pro with 16GB of rams) . What about big data? What about data size bigger than 1GB in size? or the data file is so huge you can’t load it into memory at all? Of course you can use Dask, or Vaex, but we will stay with Pandas this time.

For big data in Pandas, we will use chunksize to load only part of the file into memory at any given time.

df = pd.read_csv(’data.csv’, chunksize=10000)

Great! Now we just had a quick revision about the basic operations of Pandas, I hope this cheat sheet will give you more confidence when dealing with dataframes in Pandas.

For easy revision, I have put everything mentioned above into a Jupyter notebook file. Please feel free to download: Pandas Data Wrangling Cheat Sheet 2021 revision file.

You may be interested in this article as well: Fundamental Marketing Analytics

Thank you very much for reading my article, I have been writing for Digital Marketing, and Data Analysis, analytics, and Python as well. Please click the links below to drop by and check them out.

Digital Marketing：https://elartedm.com/marketing-blog/

Another Digital Marketing and Machine Learning：https://positivehk.com/category/%E5%B...

Data Science： https://lorenzoyeung-io.medium.com/