The world’s leading publication for data science, AI, and ML professionals.

3 Pandas Functions That Will Make Your Life Easier

A practical guide with worked through examples

Photo by Sarah Dorweiler on Unsplash
Photo by Sarah Dorweiler on Unsplash

Pandas is a prevalent data analysis and manipulation in the Data Science ecosystem. It provides lots of versatile functions and methods to perform efficient data analysis.

In this article, we will cover 3 Pandas functions that will expedite and simplify the data analysis process.


1. Convert_dtypes

For an efficient data analysis process, it is essential to use the most appropriate data types for variables.

It is mandatory to have a specific data type in order to use some functions. For instance, we cannot do any mathematical operations on a variable with object data type. In some cases, string data type is preferred over object data type to enhance certain operations.

Pandas offers many options to handle data type conversions. The convert_dtypes function converts columns to the best possible data type. It is clearly more practical to convert each column separately.

Let’s create a sample dataframe that contains columns with object data type.

import numpy as np
import pandas as pd
name = pd.Series(['John','Jane','Emily','Robert','Ashley'])
height = pd.Series([1.80, 1.79, 1.76, 1.81, 1.75], dtype='object')
weight = pd.Series([83, 63, 66, 74, 64], dtype='object')
enroll = pd.Series([True, True, False, True, False], dtype='object')
team = pd.Series(['A','A','B','C','B'])
df = pd.DataFrame({
    'name':name,
    'height':height,
    'weight':weight,
    'enroll':enroll,
    'team':team
})

The data type for all columns is object which is not the optimal choice.

df.dtypes
name      object
height    object
weight    object
enroll    object
team      object
dtype: object

We can use the convert_dtypes function as below:

df_new = df.convert_dtypes()
df_new.dtypes
name       string
height    float64
weight      Int64
enroll    boolean
team       string
dtype: object

The data types are converted to the best possible option. A useful feature of the convert_dtypes function is that we can convert the boolean values to 1 and 0. It is more appropriate for data analysis.

We just need to set the convert_boolean as False.

df_new = df.convert_dtypes(convert_boolean=False)


2. Pipe

Pipe function allows combining many operations in a chain-like fashion. It takes functions as inputs. These functions need to take a dataframe as input and return a dataframe.

Consider the following dataframe:

We want to do three operations as data preprocessing steps.

  • Convert height from meter to inch
  • Drop rows that have at least one missing value
  • Change the string columns to category if appropriate

Note: If the number of categories are very few compared to the total number values, it is better to use the category data type instead of object or string. It saves a great amount of memory depending on the data size.

We will now define a function for each operation.

def m_to_inc(dataf, column_name):
   dataf[column_name] = dataf[column_name] / 0.0254
   return dataf

def drop_missing(dataf):
   dataf.dropna(axis=0, how='any', inplace=True)
   return dataf

def to_category(dataf):
   cols = dataf.select_dtypes(include='string').columns
   for col in cols:
      ratio = len(dataf[col].value_counts()) / len(dataf)
      if ratio < 0.05:
         dataf[col] = dataf[col].astype('category')
   return dataf

You may argue that what the point is if we need to define functions. It does not seem like simplifying the workflow. You are right for one particular task but we need to think more generally. Consider you are doing the same operations many times. In such case, creating a pipe makes the process easier and also provides cleaner code.

Here is a pandas pipe that combines the three operations we defined:

df_processed = (df.
                 pipe(m_to_inc, 'height').
                 pipe(drop_missing).
                 pipe(to_category))

It looks neat and clean. We can add as many steps as needed. The only criterion is that the functions in the pipe should take a dataframe as argument and return a dataframe.

Note: One important thing to mention is that the pipe function modifies the original dataframe. We should avoid changing the original dataset if possible. To overcome this issue, we can use a copy of the original dataframe in the pipe.


3. Plot

Pandas is not a data visualization library but it is possible to create many basic plot types with Pandas.

The advantage of creating plots with Pandas is that we can quickly generate informative plots. The syntax is also fairly simple compared to the data visualization libraries.

Consider the following dataframe that contains data about a marketing campaign.

We can easily create a histogram to see the distribution of the amount spent column.

marketing.AmountSpent.plot(kind='hist', title='Amount Spent', figsize=(8,5))

We can also create a scatter plot to visualize the relationship between the salary and amount spent columns.

marketing.plot(x='Salary', y='AmountSpent', kind='scatter',
               title='Salary vs Amount Spent',
               figsize = (8,5))

There are many other plots we can easily generate by applying the plot function on dataframe or pandas series. In fact, Pandas is enough to cover most of the data visualizations needed in a typical data analysis process.

However, if you need more advanced or interactive visualizations, Pandas would not be the optimal choice.


Conclusion

We have covered three important functions of Pandas. All of them help to simplify the data analysis and manipulation process in some aspect.

There are, of course, much more functions and operations Pandas provides which makes it one of the most commonly used library for data analysis.

Thank you for reading. Please let me know if you have any feedback.


Related Articles