A comprehensive practical guide

Pandas is a highly popular data analysis and manipulation library. It provides numerous functions to transform raw data to a more useful or appropriate format for data analysis and Machine Learning pipeline.
Real life data is almost always messy and requires lots of preprocessing to be converted to a nice and clean format. Thanks to its versatile and powerful functions, Pandas expedites data wrangling process.
In this article, we will cover 7 operations that we are likely to encounter in a typical data wrangling process.
We will use the Melbourne housing dataset available on Kaggle for the examples. We first read the csv file using the read_csv function.
import numpy as np
import pandas as pd
melb = pd.read_csv("/content/melb_data.csv")
print(melb.shape)
(13580, 21)
melb.columns
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG','Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car','Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude','Longtitude', 'Regionname', 'Propertycount'],
dtype='object')
The dataset contains 21 features about 13580 houses in Melbourne.
1. Handling dates
The dates are usually stored as objects or strings. The date column in our dataset is stored as object.
melb.Date.dtypes
dtype('o')
In order to use the date time specific functions of Pandas, we need to convert the dates to an appropriate format. One option is to use the to_datetime function.
# Before converting
melb.Date[:2]
0 3/12/2016
1 4/02/2016
Name: Date, dtype: object
melb['Date'] = pd.to_datetime(melb['Date'])
# After converting
melb.Date[:2]
0 2016-03-12
1 2016-04-02
Name: Date, dtype: datetime64[ns]
2. Changing data types
In addition to dates, we may need to do some other data type conversions as well. A typical case that need conversion would be storing integers as floats. For instance, the property column in our dataset is stored as float but it should be integer.
The astype function can be used to do data type conversions.
# Before converting
melb['Propertycount'][:2]
0 4019.0
1 4019.0
Name: Propertycount, dtype: float64
melb['Propertycount'] = melb['Propertycount'].astype('int')
# After converting
melb['Propertycount'][:2]
0 4019
1 4019
Name: Propertycount, dtype: int64
3. Replacing values
Another common operation is to replace values. The type column contains 3 distinct values which are ‘h’, ‘u’, and ‘t’. We can make these values more informative by replacing them with what they represent.
The replace function is used to accomplish this task.
# Before converting
melb.Type.unique()
array(['h', 'u', 't'], dtype=object)
melb.Type.replace({
'h': 'house', 'u': 'unit', 't': 'town_house'
}, inplace=True)
# After converting
melb.Type.unique()
array(['house', 'unit', 'town_house'], dtype=object)
4. Category data type
A typical dataset contains both numerical and categorical columns. The categorical columns are usually stored with object data type. If the number of distinct categories are very few compared to the number of rows, we can save a substantial amount of memory by using the category data type.
Our dataset contains 13580 rows. The number of categories in the type column is 3. Let’s first check the memory consumption of this column.
melb.Type.memory_usage()
108768 # in bytes
We will convert it to the category data type and check the memory consumption again.
melb['Type'] = melb['Type'].astype('category')
melb['Type'].memory_usage()
13812
It went down from 108768 bytes to 13812 bytes which is a significant decrease.
5. Extracting information from dates
In some cases, we may need to extract a particular part from dates such as weekday, month, year, and so on. We can use the functions under the dt accessor to extract pretty much any piece of information about a date.
Let’s do a couple of examples.
# Extract month
melb['Month'] = melb['Date'].dt.month
melb['Month'][:5]
0 3
1 4
2 4
3. 4
4 4
Name: Month, dtype: int64
# Extract weekday
melb['Date'].dt.weekday[:5]
0 5
1 5
2 0
3. 0
4 2
Name: Date, dtype: int64
6. Extracting information from text
Textual data usually contains multiple pieces of information. Just like we have done with dates, we may need to extract a piece of information from a text. The str accessor of Pandas provides numerous function to perform such operations efficiently.
Let’s take a look at the address column.
melb.Address[:5]
0 85 Turner St
1 25 Bloomburg St
2 5 Charles St
3 40 Federation La
4 55a Park St
Name: Address, dtype: object
The last characters represent the type of location. For instance, "st" stands for street and "dr" stands for drive. It can be a useful piece of information for grouping the addresses.
We can extract the last part of the address by splitting the strings at space character and taking the last split. Here is how we do this operation with the str accessor.
melb['Address'].str.split(' ').str[-1]
0 St
1 St
2 St
3 La
4 St
Name: Address, dtype: object
The split function, as the same suggests, splits a string at the specified character which is space in our case. The next str is used for accessing the pieces after splitting. "-1" means the last one.
7. Standardizing the textual data
In many cases, we do a comparison based on textual data. A typical problem with such comparisons is not having a standard on strings. For instance, same words may not be detected if one starts with a capital case letter and the other is not.
To overcome this issue, we should standardize the strings. We can make them all upper case or lower case letters with the upper and lower functions of the str accessor, respectively.
melb.Address.str.upper()[:5]
0 85 TURNER ST
1 25 BLOOMBURG ST
2 5 CHARLES ST
3 40 FEDERATION LA
4 55A PARK ST
Name: Address, dtype: object
Another option is to capitalize the strings.
melb.Suburb.str.capitalize()[:5]
0 Abbotsford
1 Abbotsford
2 Abbotsford
3 Abbotsford
4 Abbotsford
Name: Suburb, dtype: object
Conclusion
We have covered 7 typical operations that are likely to be encountered in a data wrangling process. Pandas provides efficient and versatile solutions for all of them.
There are, of course, many other issues we might face during data cleaning and preprocessing. It is not possible to cover all of them in one article. However, I’m pretty sure Pandas has a solution for most of the tasks you need to handle.
Thank you for reading. Please let me know if you have any feedback.