
Table of Contents
- Why use Pandas?
- Getting startedInstalling and importing Pandas
- Pandas DataFrameReading files and creating a DataFrame
- Examining the DataFrame
-
Manipulating data with PandasDataFrame Index**Rows and Columns S**electing data with loc and iloc
-
Handling missing valuesDetecting missing values Removing missing values Filling missing values
- Saving to file
- Conclusion
Whether you are building complex Machine Learning models or you just want to organize your monthly budget in an Excel spreadsheet, you must know how to manipulate and analyze your data.
While many tools can get the job done, today we’re going to talk about one of the most used and beginner-friendly of them all, Pandas.
Why use Pandas?
Pandas is an open-source Python library designed to deal with Data Analysis and data manipulation. Citing the official website,
"pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."
It is built on top of NumPy (a Python library for scientific computing) and it has several functions for cleaning, analyzing, and manipulating data, which can help you extract valuable insights about your data set. Pandas is great for working with tabular data, as in SQL tables or Excel spreadsheets.
The main data structure in Pandas is a 2-dimensional table called DataFrame. To create a DataFrame, you can import data in several formats, such as CSV, XLSX, JSON, SQL, to name a few. With some lines of code, you can add, delete, or edit data in your rows/columns, check your set’s statistics, identify and handle missing entries, etc.
Besides, as stated above, Pandas is widely used and friendly for beginners, which means you’ll find a lot of content online about it and it shouldn’t be hard to find answers to your questions.
Getting started
First of all, we need to install Pandas and there are several different environments where you can run it. If you want to run it directly in your machine, you should take a look at Anaconda, a distribution aimed at scientific computing that comes with hundreds of pre-installed packages. Anaconda can be installed in Windows, macOS, and Linux.
However, there is an easier way to get started with Pandas through your browser, using a Jupyter Notebook in the cloud. For instance, you could use IBM Watson Studio or Google Colab. Both can be used for free and come with several Python packages pre-installed.
In this article, I am using Google Colab because it is really easy to use out of the box and doesn’t require any previous setups.
Installing and importing Pandas
You need to write the following command in your environment to install Pandas, depending on your package manager.
pip install pandas
or
conda install pandas
Please notice that in Google Colab we don’t need to use the code above, since Pandas comes pre-installed.
Now, we need to import Pandas, so we can use it in our Jupyter Notebook.
import pandas as pd
We commonly import it "as pd" as a shortcut, so we don’t need to write the full word every time we need to call a Pandas function.
Pandas DataFrame
After installing and importing Pandas, let’s see how we can read a file and create a Pandas DataFrame. In this article, the data set we’ll be working on is a simplified version of a set provided in a Kaggle competition about housing prices.
The file that contains this data set is a .CSV (Comma-separated values). If you want to play with it yourself, you can find it here, in my Github repository.
Reading files and creating a DataFrame
To read the file into a DataFrame, we just need to input the file path as an argument in the function below:
PATH = 'https://raw.githubusercontent.com/rmpbastos/data_sets/main/kaggle_housing/house_df.csv'
df = pd.read_csv(PATH)
Notice that I used the function _readcsv because we are working with a csv file. As mentioned above, Pandas can handle several file extensions, as you can check here.
The function above read the csv file and automatically created a DataFame from it. But if you want to create a DataFrame from a Python Dict, List, NumPy Array, or even from another DataFrame, you may use the function below.
df = pd.DataFrame(mydict)
Let’s check the type of the DataFrame we just created.
type(df)

Examining the DataFrame
For the rest of this article, we’ll be working with the housing data set mentioned above. The next thing we should do is taking a look at our DataFrame. We can check the first n entries with the function head. If n is not provided, we’ll see the first 5 rows as default.
df.head()

At first glance, everything looks fine. We can also check the last entries of the set with the function tail.
df.tail()

Next, let’s check the dimensions of our data using the shape attribute.
df.shape

It returns a tuple with the number of rows and columns. Our DataFrame has 1,460 rows and 16 columns, or features.
Moving on, we can view a summary of the data set with the function info.
df.info()

It shows us useful information about the DataFrame, such as column names, non-null values, dtypes, and memory usage. From this summary, we can observe that some columns have missing values, a topic we’ll see later.
The following function will give us some descriptive statistics about the dataset.
df.describe()

This function displays the count, mean, median, standard deviation, upper and lower quartiles, and minimum and maximum values for each feature. Notice that it only shows data about the numeric features (columns where the data type is int or float).
In the sequence, let me show you one more function, _valuecounts, before moving on to the next section.
df['Neighborhood'].value_counts()

This function returns a Series containing the number of unique values for each column. It can be applied to the whole DataFrame, but in the example above, we applied it only to the column "Neighborhood".
Before we move on, let me summarize each feature of the dataset, for a better understanding.
- Id – Unique identification for each row (we’ll use it as our index).
- LotArea – Lot size in square feet
- Street – Type of road access
- Neighborhood – Physical location of the house
- HouseStyle – Style of residence
- YearBuilt – Construction date
- CentralAir – Central air conditioning
- BedroomAbvGr – Number of bedrooms above basement level
- Fireplaces – Number of fireplaces
- GarageType – Garage location
- GarageYrBlt – Year garage was built
- GarageArea – Size of garage in square feet
- PoolArea – Pool area in square feet
- PoolQC – Pool quality
- Fence – Fence quality
- SalePrice – House price
Our dataset contains data of different types such as numerical, categorical, boolean, but we won’t dive into these concepts, as they are out of scope here.
Now, let’s start manipulating our DataFrame.
Manipulating data with Pandas
Pandas provides us with several tools for handling data. In this section, we’ll see how to manipulate rows and columns as well as locate and edit values in our table. Let’s start setting an index for our DataFrame.
DataFrame Index
After checking our data, we noticed that the first column (Id) has a unique value for each row. We can take advantage of it and use this column as our index, in place of the index created by default when we set up the DataFrame.
df.set_index('Id', inplace=True)
df.index

When setting the argument inplace as True the DataFrame will be updated in place. Otherwise, using inplace = False, which is the default value, would return a copy of the DataFrame.
If you know beforehand that you are going to use a column in your data set as the index, you can set it up when reading the file, as below.
df = pd.read_csv(PATH, index_col='Id')
Let’s check out how the set looks after setting the index.

The data set looks a bit cleaner now! Moving on, let’s talk about rows and columns.
Rows and Columns
As you have noticed, data frames are tabular data, containing rows and columns. In Pandas, a single column can be called Series. We can easily check the columns and access them with the codes below.
df.columns

df['LotArea'].head()

type(df['LotArea'])

Notice that a column in our DataFrame is of type Series.
Renaming columns
It is really simple to rename your columns with Pandas. For instance, let’s take our feature BedroomAbvGr, and rename it to Bedroom.
df.rename(columns={'BedroomAbvGr': 'Bedroom'}, inplace=True)
You can rename several columns at once. Just add all "old names" and "new names" as key/value pairs in the columns dictionary, inside the function rename.
Adding columns
You might want to add a column to your DataFrame. Let’s see how you can do that.
I’ll take the opportunity to show you how we can create a copy of our DataFrame. Let’s add a column to our copy so we don’t change the original DataFrame.
df_copy = df.copy()
df_copy['Sold'] = 'N'
This is the easiest way to create a new column. Notice that I assigned a value "N" for all entries in this column.
Check the image below with the new column added.

Adding rows
Now, suppose you have another DataFrame (_df_toappend) containing 2 rows that you want to add to _dfcopy. One way to append rows to the end of a DataFrame is with the function append.
data_to_append = {'LotArea': [9500, 15000],
'Steet': ['Pave', 'Gravel'],
'Neighborhood': ['Downtown', 'Downtown'],
'HouseStyle': ['2Story', '1Story'],
'YearBuilt': [2021, 2019],
'CentralAir': ['Y', 'N'],
'Bedroom': [5, 4],
'Fireplaces': [1, 0],
'GarageType': ['Attchd', 'Attchd'],
'GarageYrBlt': [2021, 2019],
'GarageArea': [300, 250],
'PoolArea': [0, 0],
'PoolQC': ['G', 'G'],
'Fence': ['G', 'G'],
'SalePrice': [250000, 195000],
'Sold': ['Y', 'Y']}
df_to_append = pd.DataFrame(data_to_append)
df_to_append

Now, let’s append the 2-row DataFrame above to _dfcopy.
df_copy = df_copy.append(df_to_append, ignore_index=True)
Checking the last entries of _dfcopy we have:
df_copy.tail(3)

Removing rows and columns
To eliminate rows and columns of a DataFrame, we can use the function drop. Let’s assume we want to delete the last row and the column ‘Fence’. Check out the codes below.
df_copy.drop(labels=1461, axis=0, inplace=True)
The function above removed the last row (the one with Id 1461). You can also drop several rows at once, passing a list of indexes as an argument.
The axis=0, which is the default value, means that you are removing a row. For columns, we need to specify that axis=1, as below.
df_copy.drop(labels='Fence', axis=1, inplace=True)
Selecting data with loc and iloc
One of the easiest ways to select data is with the methods loc and iloc.
loc is used to access rows and columns by label/index or based on a boolean array. Imagine we want to access the row with index = 1000.
df.loc[1000]

The method above selected the row with index = 1000 and displayed all data contained in this row. We can also select which columns we want to visualize.
df.loc[1000, ['LotArea', 'SalePrice']]

Now, let’s see how we can apply a condition to loc. Imagine we want to select all houses that have a sale price of at least $600,000.
df.loc[df['SalePrice'] >= 600000]

With a simple line of code, we found only 4 houses worth over $600,000.
iloc is used to select data based on their integer location or a boolean array. For instance, if we want to select the data contained in the first row and the first column, we have the following:
df.iloc[0,0]

The value displayed is the LotArea of the row where ID is 1. Remember that the integer location is zero-based.
We can also select an entire row. In this case, the row is in position 10
df.iloc[10,:]

We can select an entire column, for instance, the last column.
df.iloc[:,-1]

And we can also select multiple rows and columns, as below.
df.iloc[8:12, 2:5]

Handling missing values
A lot can be talked about dealing with missing values. Please bear in mind that the goal here is not to go deep into the subject, but to show you the tools provided by Pandas to handle missing values.
Detecting missing values
The first step here is to find the values that are missing in your data set with the function isnull. This function will return an object with the same size as the original DataFrame, containing boolean values for each element in the set. It will consider as True values such as None and NumPy.NaN. You can find them with the following line of code.
df.isnull()

Notice that it can be cumbersome to work with the data returned above. If you are working with a really small data set you should be fine, but with thousands of rows and several columns, as in our case, we can only add the number of missing values per column, as follows.
df.isnull().sum()

Much better now! We can easily visualize the number of missing values for each column. We could also realize that most columns are complete, which is great. The sum function added all values returned as True by isnull because they are equivalent to 1. False values are equivalent to 0.
We can also check the proportion of missing values for each column:
df.isnull().sum() / df.shape[0]

Let’s take a step further and use Python to get only the columns with missing values and display the percentage of values that are missing.
for column in df.columns:
if df[column].isnull().sum() > 0:
print(column, ': {:.2%}'.format(df[column].isnull().sum() /
df[column].shape[0]))

Removing missing values
After detecting the missing values, we need to decide what to do with them. Here, I’ll show you how to eliminate missing values.
We should be very cautious before removing a whole column or row because we are taking data out of our data set and it can harm your analysis.
First, let’s think about the feature PoolQC. Since over 99% of the values are missing in this column, we are going to remove it. As we already saw in an earlier section, we can drop a column with the function drop.
Here, I’ll use a copy of the original DataFrame.
df_toremove = df.copy()
df_toremove.drop(labels='PoolQC', axis=1, inplace=True)
Now, let’s take a look at GarageType. Since it only has about 5% of values missing, we can simply remove the rows where we have missing values for this feature, using the function dropna.
df_toremove.dropna(subset=['GarageType'], axis=0, inplace=True)
Filling missing values
When we are dealing with missing values, besides removing them we can fill these missing data with some non-null value. There are several techniques to help you determine which values you should insert in your data, including using machine learning, and I really recommend you search for articles about this topic, but here I’m only showing the tools provided by Pandas to do the job.
First, let’s take a look at the feature Fence. Notice that it has 80% of missing values. Suppose it happened because these houses just don’t have fences! So, we are filling these missing data with the string NoFence.
Once again, I’ll use a copy of the original DataFrame.
df_tofill = df.copy()
df_tofill['Fence'].fillna(value='NoFence', inplace=True)
Now, let’s check the feature GarageYrBlt. In this example, I’ll show you how to fill the entries with the median value for the column.
garage_median = df_tofill['GarageYrBlt'].median()
df_tofill.fillna({'GarageYrBlt': garage_median}, inplace=True)
Let me remember you that these examples are just for educational purposes. You might have realized that GarageType and GarageYrBlt had 81 missing values. It’s probably because those houses don’t have any garages. Removing these rows for missing GarageType and filling GarageYrBlt with some value might not be the smartest thing to do in a real-life analysis.
In fact, if I wasn’t using copies of the original DataFrame, you would see that when we dropped the rows where GarageType was missing, those were the same 81 rows where GarageYrBlt was missing as well. This shows the importance of interpreting and knowing your data before modifying it.
Visualizing data
In this section, I’ll talk about how to do some simple plotting with Pandas. If you want to build more elaborate charts, I recommend you take a look at 2 other Python libraries: Matplotlib and Seaborn.
Here, we’ll see two types of charts: histogram and scatter plot.
Histograms
Histograms are great to display the distribution of data. Below is a histogram of the feature SalePrice, where the x-axis contains bins that divide the values into intervals, and the y-axis is for the frequency.
df['SalePrice'].plot(kind='hist')

Scatter plots
With a scatter plot you can visualize the relationship between two variables. The chart is built using cartesian coordinates to display the values as a collection of points, each having the value of one variable determining the position on the x-axis and the value of the other variable determining the position on the y-axis.
Let’s build a scatter plot for the variables SalePrice and YearBuilt to check if there is any relationship between them.
df.plot(x='SalePrice', y='YearBuilt', kind='scatter')

Well, we can see that there is a small positive relationship between the sale price and the year of construction.
The plot function also supports many other kinds of charts, such as line, bar, area, pie, etc.
Saving to file
In the last section of this article, let’s see how to save our DataFrame as a file.
Just like when we read the csv file to create our DataFrame, we can also save our DataFrame in various formats. If you are using Google Colab, you can simply write the following line of code, to save a csv file.
df.to_csv('My_DataFrame.csv')
The file saved can be found in the Files folder, on the upper-left side of Google Colab, like in the image below.

You can also specify a path in your machine. In windows, for instance:
df.to_csv('C:/Users/username/Documents/My_DataFrame.csv')

Conclusion
I hope this article helped you get a grasp of what you can do with Pandas. After some practice, it gets quite effortless to manipulate your data.
Pandas is widely used in Data Science. Many data scientists make use of it to manipulate data before building machine learning models, but you can benefit from it even in simpler tasks you would do in Excel.
Nowadays, at work, many of the activities I used to do in excel I’m doing with Python and Pandas. Maybe the learning curve is a little steeper, but the potential increase in your productivity pays off, not to mention that Python is an excellent tool to have under your belt!
For the full code, please refer to the notebook.