The worldโ€™s leading publication for data science, AI, and ML professionals.

Pandas & Python Tricks for Data Science & Data Analysis – Part 5

This is the fifth part of my Pandas & Python Tricks

Photo by Andrew Neel on Unsplash
Photo by Andrew Neel on Unsplash

Introduction

A couple of days ago, I shared some Python and Pandas tricks to help Data Analysts and Data Scientists quickly learn new valuable concepts that they might not be aware of. This is also part of the collection of tricks I share daily on LinkedIn.

Pandas

Combine SQL statements and Pandas

My gut feeling is telling me that more than 80% of the Data Scientists use Pandas in their daily Data Science activities.

And, I believe that this is because of the benefits it offers of being part of the wider range of the Python universe, making it accessible to many people.

๐™’๐™๐™–๐™ฉ ๐™–๐™—๐™ค๐™ช๐™ฉ ๐™Ž๐™Œ๐™‡? Even though not everyone uses it in their daily life (because not every company has necessary a SQL Database?), SQL’s performance is undeniable. Also, it is human-readable which makes it easily understood by even non-tech people.

โ“What if we could find a way to ๐™˜๐™ค๐™ข๐™—๐™ž๐™ฃ๐™š ๐™ฉ๐™๐™š ๐™—๐™š๐™ฃ๐™š๐™›๐™ž๐™ฉ๐™จ ๐™ค๐™› ๐™—๐™ค๐™ฉ๐™ ๐™‹๐™–๐™ฃ๐™™๐™–๐™จ ๐™–๐™ฃ๐™™ ๐™Ž๐™Œ๐™‡ statements?

โœ… Here is where ๐—ฝ๐—ฎ๐—ป๐—ฑ๐—ฎ๐˜€๐—พ๐—น comes in handy ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰

Below is an illustration ๐Ÿ’ก Also you can watch the full video here.

Update data of a given dataframe with another dataframe

There are multiple ways of replacing missing values ๐Ÿงฉ in Pandas, from simple imputation to more advanced methods.

But … ๐Ÿšจ

Sometimes, you just want to replace them using non-NA values from another DataFrame.

โœ… This can be achieved using the built-in update function from Pandas.

It aligns both DataFrames on their index and columns before performing the update.

General syntax โš™๏ธ below:

๐—ณ๐—ถ๐—ฟ๐˜€๐˜_๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ.๐˜‚๐—ฝ๐—ฑ๐—ฎ๐˜๐—ฒ(๐˜€๐—ฒ๐—ฐ๐—ผ๐—ป๐—ฑ_๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ)

โœจ missing values from ๐—ณ๐—ถ๐—ฟ๐˜€๐˜_๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ dataframe are replaced with non-missing values from ๐˜€๐—ฒ๐—ฐ๐—ผ๐—ป๐—ฑ_๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ

โœจ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐˜„๐—ฟ๐—ถ๐˜๐—ฒ=๐—ง๐—ฟ๐˜‚๐—ฒ will overwrite ๐—ณ๐—ถ๐—ฟ๐˜€๐˜_๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ’s values from using ๐˜€๐—ฒ๐—ฐ๐—ผ๐—ป๐—ฑ_๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ data, and this is the default value. If ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐˜„๐—ฟ๐—ถ๐˜๐—ฒ=๐—™๐—ฎ๐—น๐˜€๐—ฒ only the missing values are replaced.

Here is an illustration ๐Ÿ’ก

From unstructured to structured data

Data preprocessing is full of challenges ๐Ÿ”ฅ

Imagine you have this data with candidates’ information in the following format:

‘๐—”๐—ฑ๐—ท๐—ฎ ๐—ž๐—ผ๐—ป๐—ฒ: ๐—ต๐—ฎ๐˜€ ๐— ๐—ฎ๐˜€๐˜๐—ฒ๐—ฟ ๐—ถ๐—ป ๐—ฆ๐˜๐—ฎ๐˜๐—ถ๐˜€๐˜๐—ถ๐—ฐ๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ถ๐˜€ ๐Ÿฎ๐Ÿฏ ๐˜†๐—ฒ๐—ฎ๐—ฟ๐˜€ ๐—ผ๐—น๐—ฑ’

‘๐—™๐—ฎ๐—ป๐˜๐—ฎ ๐—ง๐—ฟ๐—ฎ๐—ผ๐—ฟ๐—ฒ: ๐—ต๐—ฎ๐˜€ ๐—ฃ๐—ต๐—— ๐—ถ๐—ป ๐—ฆ๐˜๐—ฎ๐˜๐—ถ๐˜€๐˜๐—ถ๐—ฐ๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ถ๐˜€ ๐Ÿฏ๐Ÿฌ ๐˜†๐—ฒ๐—ฎ๐—ฟ๐˜€ ๐—ผ๐—น๐—ฑ’

Then, your task is to generate a table with the following information per candidate for further analysis:

โœจ The first and last name

โœจ The degree and field of study

โœจ The Age

๐Ÿšจ Performing such a task can be daunting ๐Ÿคฏ

โœ… This is where the ๐˜€๐˜๐—ฟ.๐—ฒ๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜() function in Pandas can help!

It is a powerful text-processing function for extracting structured information from unstructured textual data.

Below is an illustration ๐Ÿ’ก

Perform multiple aggregations with the agg() function

If you want to perform multiple aggregation functions like ๐˜€๐˜‚๐—บ, ๐—ฎ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—ด๐—ฒ, ๐—ฐ๐—ผ๐˜‚๐—ป๐˜ … on one or multiple columns.

โœ… You can combine ๐—ด๐—ฟ๐—ผ๐˜‚๐—ฝ๐—ฏ๐˜†() ๐—ฎ๐—ป๐—ฑ ๐—ฎ๐—ด๐—ด() ๐—ณ๐˜‚๐—ป๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€ from Pandas in one line of code.

Here is a Scenario ๐ŸŽฌ ๐Ÿ‘‡๐Ÿฝ

Let’s imagine this students’ data containing information about:

โœจ Students’ areas of study

โœจ Their grades

โœจ The graduation years and the age of each student.

And, you have been requested to compute the following information per area of study and year:

โ†’ The number of students

โ†’ The average grade

โ†’ The average age

Below is an image illustration ๐Ÿ’ก for solving the scenario.

Select observations between two specified times

When working with time series data, you might want to select observations between two specified times for further analysis.

โœ… This can be quickly achieved using the ๐—ฏ๐—ฒ๐˜๐˜„๐—ฒ๐—ฒ๐—ป_๐˜๐—ถ๐—บ๐—ฒ() function.

Below is an illustration ๐Ÿ’ก

Python

Check if all elements meet a certain condition

โŒ The combination of ๐—ณ๐—ผ๐—ฟ loops and ๐—ถ๐—ณ statements is not always the most elegant way when writing Python code.

For instance, let’s say that you want to check if all the elements of an iterable meet a certain condition.

Two possibilities may arise:

1๏ธโƒฃ Either use for loop and if statement.

OR

2๏ธโƒฃ Use the all() built-in function

Below is an illustration ๐Ÿ’ก

Check if any element meets a certain condition

Similarly to the previous case, if you want to check if at least one element of an iterable meet a certain condition.

โœ… Then use the any() built-in function which is more elegant than using for loop and if statement.

The illustration is similar to the above image.

Avoid nested for loops

Writing nested ๐—ณ๐—ผ๐—ฟ loops is almost inevitable when your program becomes bigger and more complicated.

โŒ This can also make your code difficult to read and maintain.

โœ… A better alternative is to use the built-in ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜() function instead.

Below is an illustration ๐Ÿ’ก

Automatically handle index in a list

Imagine you have to access elements in a list and their indexes at the same time.

One way of doing it is handling manually the indexes in a for loop.

โœ… Instead, you can use the ๐—ฒ๐—ป๐˜‚๐—บ๐—ฒ๐—ฟ๐—ฎ๐˜๐—ฒ() built-in function.

This has two main benefits (I can think of).

โœจ First it automatically handles the index variable.

โœจ Then makes the code more readable.

Below is an illustration ๐Ÿ’ก

Conclusion

Thank you for reading! ๐ŸŽ‰ ๐Ÿพ

I hope you found this list of Python and Pandas tricks helpful! Keep an eye on here, because the content will be maintained with more tricks on a daily basis.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Would you like to buy me a coffee โ˜•๏ธ? โ†’ Here you go!

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

Before you leave find the last two parts of this series below:

Pandas & Python Tricks for Data Science & Data Analysis – Part 1

Pandas & Python Tricks for Data Science & Data Analysis – Part 2

Pandas & Python Tricks for Data Science & Data Analysis – Part 3

Pandas & Python Tricks for Data Science & Data Analysis – Part 4


Related Articles