The world’s leading publication for data science, AI, and ML professionals.

4 Less-Known Yet Very Functional Pandas Operations

Discover more Pandas.

Photo by Kym Ellis on Unsplash
Photo by Kym Ellis on Unsplash

Pandas being the most widely used Data Analysis and manipulation library provides numerous functions and methods to work with data. Some of them are used more frequently than others because of the tasks they perform.

In this post, we will cover 4 pandas operations that are less frequently used but still very functional.

Let’s start with importing NumPy and Pandas.

import numpy as np
import pandas as pd

1. Factorize

It provides a simple way to encode categorical variables which is a required task in most machine learning techniques.

Here is a categorical variable from a customer churn dataset.

df = pd.read_csv('/content/Churn_Modelling.csv')
df['Geography'].value_counts()
France     5014 
Germany    2509 
Spain      2477 
Name: Geography, dtype: int64

We can encode the categories (i.e. convert to numbers) with just one line of code.

df['Geography'], unique_values = pd.factorize(df['Geography'])

The factorize function returns the converted values along with an index of categories.

df['Geography'].value_counts()
0    5014 
2    2509 
1    2477 
Name: Geography, dtype: int64
unique_values
Index(['France', 'Spain', 'Germany'], dtype='object')

If there are missing values in the original data, you can specify a value to be used for them. The default value is -1.

A = ['a','b','a','c','b', np.nan]
A, unique_values = pd.factorize(A)
array([ 0,  1,  0,  2,  1, -1])
A = ['a','b','a','c','b', np.nan]
A, unique_values = pd.factorize(A, na_sentinel=99)
array([ 0,  1,  0,  2,  1, 99])

2. Categorical

It can be used to create a categorical variable.

A = pd.Categorical(['a','c','b','a','c'])

The categories attribute is used to access the categories:

A.categories
Index(['a', 'b', 'c'], dtype='object')

We can only assign new values from one of the existing categories. Otherwise, we will get a value error.

A[0] = 'd'

We can also specify the data type using the dtype parameter. The default is the CategoricalDtype which is actually the best one use because of memory consumption.

Let’s do an example to compare memory usage.

This is the memory usage in bytes for each column.

countries = pd.Categorical(df['Geography'])
df['Geography'] = countries

The memory usage is 8 times less than the original feature. The amount of memory saved will further increase on larger datasets especially when we have very few categories.


3. Interval

It returns an immutable object representing an interval.

iv = pd.Interval(left=1, right=5, closed='both')
3 in iv
True
5 in iv
True

The closed parameter indicates if the bounds are included. The values it takes are "both", "left", "right", and "neither". The default value is "right".

iv = pd.Interval(left=1, right=5, closed='neither')
5 in iv
False

The interval comes in handy when we are working with date-time data. We can easily check if the dates are in a specified interval.

date_iv = pd.Interval(left = pd.Timestamp('2019-10-02'), 
                      right = pd.Timestamp('2019-11-08'))
date = pd.Timestamp('2019-10-10')
date in date_iv
True

4. Wide_to_long

Melt converts wide dataframes to long ones. This task can also be done with the melt function. Wide_to_long offers a less flexible but more user-friendly way.

Consider the following sample dataframe.

It contains different scores for some people. We want to modify (or reshape) this dataframe in a way that the score types are represented in a row (not as a separate column). For instance, there are 3 score types under A (A1, A2, A3). After we convert the dataframe, there will only be on column (A) and types (1,2,3) will be represented with row values.

pd.wide_to_long(df, stubnames=['A','B'], i='names', j='score_type')

The stubnames parameter indicates the names of the new columns that will contain the values. The column names in the wide-format need to start with the stubnames. The "i" parameter is the column to be used as the id variable and the ‘j’ parameter is the name of the column that contains subcategories.

The returned dataframe has a multi-level index but we can convert it to a normal index by applying the reset_index function.

pd.wide_to_long(df, stubnames=['A','B'], i='names', j='score_type').reset_index()

Pandas owes its success and predominance in the field of Data Science and machine learning to the variety and flexibility of the functions and methods. Some methods perform basic tasks whereas there are also detailed and more specific ones.

There are usually multiple ways to do a task with Pandas which makes it easily fit specific tasks well.

Thank you for reading. Please let me know if you have any feedback.


Related Articles