
Pandas being the most widely used Data Analysis and manipulation library provides numerous functions and methods to work with data. Some of them are used more frequently than others because of the tasks they perform.
In this post, we will cover 4 pandas operations that are less frequently used but still very functional.
Let’s start with importing NumPy and Pandas.
import numpy as np
import pandas as pd
1. Factorize
It provides a simple way to encode categorical variables which is a required task in most machine learning techniques.
Here is a categorical variable from a customer churn dataset.
df = pd.read_csv('/content/Churn_Modelling.csv')
df['Geography'].value_counts()
France 5014
Germany 2509
Spain 2477
Name: Geography, dtype: int64
We can encode the categories (i.e. convert to numbers) with just one line of code.
df['Geography'], unique_values = pd.factorize(df['Geography'])
The factorize function returns the converted values along with an index of categories.
df['Geography'].value_counts()
0 5014
2 2509
1 2477
Name: Geography, dtype: int64
unique_values
Index(['France', 'Spain', 'Germany'], dtype='object')
If there are missing values in the original data, you can specify a value to be used for them. The default value is -1.
A = ['a','b','a','c','b', np.nan]
A, unique_values = pd.factorize(A)
array([ 0, 1, 0, 2, 1, -1])
A = ['a','b','a','c','b', np.nan]
A, unique_values = pd.factorize(A, na_sentinel=99)
array([ 0, 1, 0, 2, 1, 99])
2. Categorical
It can be used to create a categorical variable.
A = pd.Categorical(['a','c','b','a','c'])
The categories attribute is used to access the categories:
A.categories
Index(['a', 'b', 'c'], dtype='object')
We can only assign new values from one of the existing categories. Otherwise, we will get a value error.
A[0] = 'd'

We can also specify the data type using the dtype parameter. The default is the CategoricalDtype which is actually the best one use because of memory consumption.
Let’s do an example to compare memory usage.

This is the memory usage in bytes for each column.
countries = pd.Categorical(df['Geography'])
df['Geography'] = countries

The memory usage is 8 times less than the original feature. The amount of memory saved will further increase on larger datasets especially when we have very few categories.
3. Interval
It returns an immutable object representing an interval.
iv = pd.Interval(left=1, right=5, closed='both')
3 in iv
True
5 in iv
True
The closed parameter indicates if the bounds are included. The values it takes are "both", "left", "right", and "neither". The default value is "right".
iv = pd.Interval(left=1, right=5, closed='neither')
5 in iv
False
The interval comes in handy when we are working with date-time data. We can easily check if the dates are in a specified interval.
date_iv = pd.Interval(left = pd.Timestamp('2019-10-02'),
right = pd.Timestamp('2019-11-08'))
date = pd.Timestamp('2019-10-10')
date in date_iv
True
4. Wide_to_long
Melt converts wide dataframes to long ones. This task can also be done with the melt function. Wide_to_long offers a less flexible but more user-friendly way.
Consider the following sample dataframe.

It contains different scores for some people. We want to modify (or reshape) this dataframe in a way that the score types are represented in a row (not as a separate column). For instance, there are 3 score types under A (A1, A2, A3). After we convert the dataframe, there will only be on column (A) and types (1,2,3) will be represented with row values.
pd.wide_to_long(df, stubnames=['A','B'], i='names', j='score_type')

The stubnames parameter indicates the names of the new columns that will contain the values. The column names in the wide-format need to start with the stubnames. The "i" parameter is the column to be used as the id variable and the ‘j’ parameter is the name of the column that contains subcategories.
The returned dataframe has a multi-level index but we can convert it to a normal index by applying the reset_index function.
pd.wide_to_long(df, stubnames=['A','B'], i='names', j='score_type').reset_index()

Pandas owes its success and predominance in the field of Data Science and machine learning to the variety and flexibility of the functions and methods. Some methods perform basic tasks whereas there are also detailed and more specific ones.
There are usually multiple ways to do a task with Pandas which makes it easily fit specific tasks well.
Thank you for reading. Please let me know if you have any feedback.