Pandas Exercise for Data Scientists — Part 2

A set of challenging Pandas Questions

Avi Chawla
Towards Data Science

--

Photo by ALAN DE LA CRUZ on Unsplash

Pandas library has always intrigued Data Scientists to do amazing things with it. It is undoubtedly the go-to tool for tabular data handling, manipulation, and processing.

Therefore, to scale your expertise, challenge your existing knowledge, and introduce you to numerous popular Pandas functions among Data Scientists, I am presenting Part 2 of the Pandas Exercise. You can find the Part 1 of the Pandas Exercise here:

The objective is to strengthen your logical muscle and help internalize data manipulation with one of the best Python packages for data analysis.

Find the notebook with all questions for this quiz here: GitHub.

Table of Contents:

1. The cumulative sum of a column in DataFrame
2. Assign Unique IDs to every Group
3. Check if a column has NaN values
4. Append a list as a row to a DataFrame
5. Get the first row of every unique value in a column
6. Identify the source of each row in Pandas Merge
7. Filter n-largest and n-smallest values from a DataFrame
8. Map categorical data to unique integral values
9. Add prefix to every column name
10. Convert categorical columns to one hot values

As an exercise, I recommend you attempt the questions yourself and then look at the solution I have provided.

Note that the solutions I have provided here may not be the only way to solve the problem. You may come up with something different and still be correct. However, if that happens, do drop a comment, and I’ll be interested to know your approach.

Let’s begin!

1. The cumulative sum of a column in DataFrame

Prompt: You are given a DataFrame. Your task is to generate a new column from the integral column, which represents the cumulative sum of the column.

Input and Expected Output:

Solution:

Here, we can use the cumsum() method on the given series and obtain the cumulative sum as shown below:

P.S. Can you also try the Cumulative Product, Cumulative Maximum, and Cumulative Minimum?

2. Assign Unique IDs to every Group

Prompt: Next, you have a DataFrame in which one column has repeating values. Your task is to generate a new series so that every group gets a unique number.

Input and Expected Output:

Below, the value “A” in col_A has been assigned the value 1 in the new series. Further, for every occurrence of “A”, the value in the group_num column is always 1.

Solution:

Here, after group_by, you can use the grouper.group_info method as shown below:

3. Check if a column has NaN values

Prompt: As the next problem, your task is to determine whether there is a NaN value present in a column or not. You don’t need to find the number of NaN values or anything, just True or False whether there are one or more NaN values in the column.

Input and Expected Output:

Solution:

Here, we can use the hasnans method on the series to get the desired result as demonstrated below:

4. Append a list as a row to a DataFrame

Prompt: Everyone knows how to push elements to a python list (using the append method on the list). However, have you ever appended a new row to a DataFrame? For the next task, you are given a DataFrame and a list that should be appended as a new row in the DataFrame.

Input and Expected Output:

Solution:

Here, we can use loc and assign the new row to a new index of the DataFrame as shown below:

5. Get the first row of every unique value in a column

Prompt: Given a DataFrame, your task is to get the entire row of the first occurrence of every unique element in the column col_A.

Input and Expected Output:

Solution:

Here, we will use GroupBy on the given column and get the first row as shown below:

6. Identify the source of each row in Pandas Merge

Prompt: Next, consider that you have two DataFrames. Your task is to join them so that the output contains a column that denotes the source of the row from the original DataFrame.

Input and Expected Output:

Solution:

We can use the merge method and pass the indicator argument as True, as shown below:

7. Filter n-largest and n-smallest values from a DataFrame

Prompt: In this exercise, you are given a DataFrame. Your task is to get the entire row whose value in col_B belongs to the top-k entries of the column.

Input and Expected Output:

Solution:

We can use the nlargest method and pass the number of top values we need from the specified column:

Similar to the above method, you can use the nsmallest method to get the top-k smallest values from the column.

8. Map categorical data to unique integral values

Prompt: Next, given a DataFrame, you need to map every unique entry of a column to a unique integral identifier.

Input and Expected Output:

Solution:

Using the pd.factorize method, you can generate a new series that denotes the integer-based encodings of the given column.

9. Add prefix to every column name

Prompt: Similar to earlier tasks, you are given the same DataFrame. Your job is to rename all the columns and add “pre_” as a prefix to all of them.

Input and Expected Output:

Solution:

Here, we can use the add_prefix method and pass the string we want as a prefix in all column names as shown below:

10. Convert categorical columns to one hot values

Prompt: Lastly, you are given a categorical column in a DataFrame. You need to convert it to one-hot values.

Input and Expected Output:

Solution:

Here, we can use the get_dummies method and pass the series as an argument, as shown below:

This brings us to the end of this quiz, and I hope you enjoyed attempting this. Let me know many you got correct. Also, if you didn’t notice, this entire quiz is available in a Jupyter Notebook which you can download from here.

Also, stick around as I intend to release many more practice exercises soon. Thanks for reading.

--

--

👉 Get a Free Data Science PDF (550+ pages) with 320+ tips by subscribing to my daily newsletter today: https://bit.ly/DailyDS.