The world’s leading publication for data science, AI, and ML professionals.

A Comprehensive Guide to Using Pandas in Python

Learn how to work with one of the most famous data manipulation libraries in Python

When you start to work with Python in the context of Data Analysis, Engineering or Science, pandasis (likely) one of the first libraries that you will have to learn about. This incredible library enables you to manipulate two very important objects in the Python language – the 1 dimensional Series and the two dimensional DataFrame. These objects are part of a lot of data pipelines and mastering them is crucial to start your Pytyon career.

Dataframes are widely used throughout data science and analytics, as they enable the creation of multidimensional and multi-type objects. The goal of this post is to provide a very complete guide on how to use some famous pandas functions and how to work with the most important features of the library. Hopefully, after reading this guide, you will be ready to work with the most important pandas eatures. It may also be very common that you are migrating from a SQL background, so I’ll try to leave a comparison with SQL code throughout some instructions in the post, so that it is easier to compare the instructions between the two frameworks. But, keep in mind that knowing SQL is definitely not a requirement to learn pandas !

Throughout this post, we’ll use a variety of data to learn about pandas, namely:

  • We’ll build our own pandas Series and DataFrames using object creation commands.
  • We’ll work with three datasets containing information about stock prices, available here (https://www.kaggle.com/datasets/rprkh15/sp500-stock-prices) – namely, we’ll use Ford, Apple and Abbvie stock price data.

In this post we’ll cover the most famous pandas features, namely:

  • Creating dataframes
  • Selecting rows
  • Selecting columns
  • Combining dataframes
  • Plotting data
  • Grouping data
  • Chaining functions

Without further ado, let’s start!

Importing the Library

Before we start, we need to import the pandas library into our Python environment to make sure that we can work with all the functions we will see throughout the post.

If you are new to Python, this is a standard in the language. We must import external libraries before using them in our code, as external libraries are not available on our environment when we install Python (on Windows) or in the base version (available on Mac or Ubuntu). This concept of library is pretty common in Open Source languages, as it is the main way that the community uses to extend the base features of the language.

Importing pandas is easy, we just need to run the following code in our Python Notebook or executable:

import pandas as pd

And now you must be questioning yourself.. why pandas as pd ? Mostly, this is done so that we can use the library’s functions using an alias. As we will need to refer the library name every time we call a function in our code, it will be easier to use a shorter alias like pd. Pandas users commonly use this standard, and you see it in a lot of code scattered on the internet.

The Pandas Series

No, Pandas do not have a TV Show!

A pandas Series is a uni-dimensional object able to store one data type at a single time. For example, let’s create a simple pandas Series with different integers using the pd.Series function:

pd.Series([10,20,30,40,50])

Let me pause here for a moment. pd.Series means that we are calling the Series method from the pd (alias for pandas!) library we have loaded in our environment. This is a standard syntax in the Python language: library.method

Our series contains 5 integers: 10, 20, 30, 40, 50. In the output of our command, we also see dtype: int 64 . By default pandas, stores integers has a 64-bit integer, allowing for a larger range of values when defining our numbers.

Apart from integers, we can also store floats in a pd.Seriesby passing numbers with decimal places in the pd.Series:

pd.Series([10.1,20.2,30.4,40.6,50.2])

Notice that our dtype is now float64 . Any calculation done on top of these values will keep the numeric precision. We’ll see that in a bit!

Naturally, we can also store strings in a pd.Series object:

pd.Series(['Portugal', 'Spain', 'France'])

Don’t get too confused but in pandas, strings are stored with the object data type. In fact, most of the data (except time based datatypes, categoricals or boolean) that is not either integers or floats will be stored as an object data type – for example, defining a Series with nested lists:

pd.Series([[1,2], [2,3], [1,2]])

If you don’t master lists in Python, don’t worry too much. This is a pretty uncommon operation and I’ve just done this experiment to show you how the object data type is not exclusive to text data.

A caveat – watch out for situations when you try to define a mix of strings and numbers in a pd.Series , as this turns the entire object into a string (object):

pd.Series([1,2,'ABC'])

In Python, we can save our objects using the assignment operator =. Let’s save this last series in an object named series_1, so that we can conveniently call pandasmethods on it later:

series_1 = pd.Series([1,2,'ABC'])

One property we can access on our pd.Series object is the dtype property. Accessing the dtype will output the data type stored in our Series:

series_1.dtype

This will print the type of object our series contains – in the series_1 example, the data type stored in it is 'O' , for object.

In this first section of the pandas guide, we’ve seen the pd.Series , an object type that is, in a nutshell:

  • Series are unidimensional objects that can store a single dimension at a time.
  • They can only hold one single data type at a time.

But.. we still have a lot to cover in our guide! For example, what if we want to retrieve specific elements from our pd.Series? Can we do that in pandas?

Yes, we can! And to do it, we need to rely on the concept of an index, something we will see next!

Indexing Series

One of the neat tricks of pandas is that we can attach indexes to our data and use a value to retrieve elements from our objects.

For example, let me create a pandas series with some countries and capitals:

countries_series = pd.Series(data=['Lisbon','Madrid','Paris'], index = ['Portugal','Spain','France'])

Notice that we have something new on our series – the index argument! This argument unlocks the ability to give custom indexes to our values:

On the left side we have the values for the indexes. On the right side, we have the values stored in the pandas Series -text values with the capitals of each city.

Now, imagine that we would like to retrieve the capital of Portugalfrom the countries_series— before using the named index, one way to do that is to provide the numeric index, in the following way:

countries_series[0]

Even if we don’t provide any custom index to the series , the numeric indexing will always work!

But, of course, this is not very practical. What if we have 200 countries in our table? How can we retrieve the data without having to memorize the 200 indexes?

Luckily, we can also use named indexes to retrieve our data by providing the custom index inside the square brackets:

countries_series['Portugal']

This is a more meaningful way to extract data from our pandas Series. Using indexes, we can extract specific data points from our object. Another common technique we use pandasis using lists to extract multiple elements at the same time:

countries_series[[0,1]]

In the example above, we are extracting two data points from our countries_series – the elements on position 1 and 2. Remember that Python indexes start at 0.

With named indexes, we can also pass multiple elements by passing a list inside the index:

countries_series[['Portugal','Spain']]

Getting to know the series object is an essential step to understand the staple object of pandas – the dataframe! We’ve seen that series object can only store data in one dimension and with a single type every time. This is a major setback that would prevent us from being productive on our data pipelines and that’s why we need to study an object that is a bit more complex! Let’s see that next.

Creating DataFrames

So far, we’ve only dealt with unidimensional and single type objects. This is not very practical, of course, particularly when we want to work with more complex data.

Creating dataframes is relatively easy – we can use the pd.DataFrame function to create one:

df_example = pd.DataFrame([24, 23, 22])

My df_example contains one column with three integer rows:

A good thing about dataframes is that they support two dimensional objects (rows and columns). Let’s add another column to our data, something that Series do not support:

df_example = pd.DataFrame([['John','Joe','Anne'],
                          [24, 23, 22]])

Our df_example now looks like the following:

Uh-oh! Our pd.DataFrame command adds data row-by-row. Notice that we passed a list of lists with the data: [['John','Joe','Anne'], [24,23,22]] .

We have two ways to correct this behavior, either we transpose our DataFrame , or change the data structure we will feed to the pd.DataFrame – let’s start by transposing the dataframe :

df_example = pd.DataFrame([['John','Joe','Anne'],
                          [24, 23, 22]]).T

This operation results in the following:

Calling the.T property of a DataFrame will transpose the object, switching the rows with the columns. A more meaningful way to pass data into pd.DataFrame in this format is to use the power of Python dictionaries:

df_example = pd.DataFrame({'students_name':['John','Joe','Anne'],
                          'age':[24, 23, 22]})

Here’s an explanation of what this dataframe creates:

  • students_name will contain three names: John , Joe and Anne .
  • age will contain three ages: 24, 23 and 22.

With this object, we immediately give column names: students_name and age . The column names will take the values of the keys of dictionary we are passing.

We know that, in this table, we only have one row per student – it would be interesting to have the name of the student as our index, so that we can benefit from the index properties available in pandas objects. We can do that by declaring the student name as an index instead:

df_example = pd.DataFrame({'age':[24, 23, 22]},
                         index=['John','Joe','Anne'])

Our df_example has a new look:

Notice that student_name is no longer a column but it is being used as an index of the object. This means that we can now use all the indexing properties we’ve learned before – can you guess what df_example['John'] will yield?

An error! Why? As DataFrames are multidimensional objects, we are not able to index them directly, just like we’ve done with pd.Series but we can rely on the loc method:

df_example.loc['John']

We can also retrieve multiple values from our DataFrame by passing a list into loc :

df_example.loc[['John','Anne']]

In the practical example, we’ll see how we can expand selecting and indexing values to columns.

Lastly, let’s see how to index our pandas DataFrame by numeric index. For that, we need to rely on iloc :

df_example.iloc[0]

This command will index the first row of our DataFrame and it acts similarly to the named version of indexing df_example.loc['John'] .

We’ve had our introduction to Series and DataFrame – let’s now dive deeper into a more practical example using the stock price data. This will help you understand more about pandas and how to use this great library.

Reading CSVs into DataFrames

As I’ve detailed in the introduction of the blog post, we’re going to use a subset of a Kaggle dataset available here.

Let’s start by reading two CSV files into DataFrame objects. This can be achieved by accessing the pd.read_csv function:

apple = pd.read_csv('AAPL.csv')
ford = pd.read_csv('F.csv')

We’ve created two different data frames called apple and ford that will hold different information for both companies.

Next, let’s do a small pipeline where we’ll explore the content and structure of our newly created objects!

Exploring our Newly Available DataFrames

In this chapter, we’ll look into the apple DataFrame. The first thing that most people do as soon as they have a dataframe object is to call the .head() and .tail() methods. Let’s see the output of both below:

apple.head(10)

The head command extracts the top n rows from a DataFrame and shows it in the output. n is the integer we pass inside the method of the DataFrame.

tail will, as you may have guessed, return the last n elements of the DataFrame:

apple.tail(10)

These two commands are analogous to sorting tables in SQL and calling LIMIT . For instance, a query SELECT * FROMAPPLELIMIT 10 would achieve the same result (assuming the rows are sorted by date).

When we have a DataFrame, it’s extremely usual to understand it’s structure by using some neat pandas properties. Let’s detail some of them below:

  • .shape method that gives us the number of rows and columns, in the format of (number of rows, number of columns). For example apple.shape will give us the information that our DataFrame contains 10.483 rows and 8 columns:

  • .columns will give us a list of our column names:

  • .index will give us a name of our indexes (row names):

We haven’t provided an index to our DataFrame (it’s still using the automatic numeric index created by the read.csv function). We’ll learn how to assign meaningful named indexes in a bit.

  • .describe() gives us a cool overview of all numeric columns:

The describemethod outputs important statistics about our numeric columns such as the average, standard deviation, minimum, quantile data and maximum of the distribution.

If you want to take a peek at the values of a certain column, you can select the column directly – there are two main ways to do this in pandas:

  • Using indexing, we pass the name of the column inside the square brackets (similarly to how we select rows in pd.Series):
apple['Volume']

We can also use the dot notation:

apple.Volume

Python programmers tend to prefer the indexing method for two main reasons: it’s hard to pass the column name as a parameter in the dot notation and the latter does not support columns with blank spaces in the name.

Oh, and want to retrieve an average or a standard deviation from a specific pandas column really fast? Just call the method on top of the object! For example, to calculate the mean Volume:

apple.Volume.mean()

Now that we have a complex pandas DataFrame with multiple columns and rows, let’s learn how we can subset information expanding our knowledge with iloc and loc

Subsetting Information

More than explaining you every nuance of the indexing methods in Pandas, in this section we will answer questions about our data with code! In this way, you will be able to practice the translation between natural language and indexing methods that can be used in the library.

Let’s start by trying to select the rows that will answer some questions:

  • Can we select the days where Apple’s closing stock price was higher than $20?

We can answer this question by providing apple.loc[apple['Close'] > 20] .This code will output all days that respect this condition:

This is a really feature that we can achieve by using .loc . Notice that inside the loc we now pass our conditionapple['Close'] > 20 and the output will respect that condition.

But.. what if I want to subset a range of values, for example: Can we select the days where Apple’s closing stock price was between 20$ and 30$?

We can achieve that by wrapping our conditions in parenthesis inside the .loc :

apple.loc[(apple['Close'] > 20) & (apple['Close'] < 30)]

The ampersand (&)connects our conditions with an AND condition, while the pipe symbol(|)is used for OR conditions.

Let me translate the conditions above into SQL-like code:

  • On apple.loc[apple['Close'] > 20] , we are doing something similar to this query: SELECT * FROM 'apple' WHERE Close > 20 .
  • On apple.loc[(apple['Close'] > 20) & (apple['Close'] < 30)] , we are doing something similar to this query: SELECT * FROM 'apple' WHERE Close > 20 AND Close < 30.

We can also retrieve specific columns from our multidimensional object using a magic trick. Just by adding a comma in the .loc , we can subset columns (that are considered the second dimension of the object):

apple.loc[apple['Close'] &gt; 20, 'Volume']

Above, we are answering the question: What is the Volume of the days where Apple’s closing stock price was higher than $20?

Translating this into SQL: SELECT Volume FROM 'apple' WHERE Close > 20 .

But, what if we want multiple columns? Can you guess how to change the pandas code above?

Answer: List indexing!

apple.loc[apple['Close'] &gt; 20, ['Volume','Close']]

Another common wrangling operation consists of grouping data. Let’s see how we can do that in the next section!

Grouping Information

Right now, my data is too granular – I would like to extract the average price per year of apple and ford stock price. To do that, we need to do two things:

  • Create a new column that will contain the year extracted from the Datecolumn.
  • Extract the mean of the Close price grouped by year column.

We have an issue with the Date column – why? Because, right now, it is being considered an Object column as we can see by calling the .dtypes property:

apple.dtype

Although we can perform a substring operation to extract year, it would be better to use datetime properties to extract information from this Date. Let me introduce another data type available in pandas – the datetime !

To convert an object into datetime , we can call the to_datetime function:

pd.to_datetime(apple['Date'])

In the argument of the to_datetime function, we can pass the column(s) that we want to convert to date. Notice that we can convert this column because it has a format expected by pandas dates (more information about formats in the official documentation of the library).

Now, the returning object dtype changed:

Having this datetimeobject, we can access dtproperties and conveniently extract the year using .dt.year. Let’s try that below:

pd.to_datetime(apple['Date']).dt.year

But, how can we add this column to our existing DataFrame? That’s super easy – we just assign it to something that does not exist (yet!):

apple['Year'] = pd.to_datetime(apple['Date']).dt.year

This will create a new column in our apple DataFrame named Year, containing the extracted year from the Datecolumn.:

First part of our process is completed! Now, let’s go for the second part, where we will produce an average of the Close price by Year – that sounds difficult, right?

But it is not! As you probably guessed, things in pandas are super simple, after getting the hang of how the library works. First, we need to call the groupby method that enables us to.. well.. group by a column!

apple.groupby(['Year'])

The groupby method does not output anything (other than a generic method output), it just prepares pandas to receive something that will be grouped by the column in the argument. After generating the groupby object, we can pass the column that we want to aggregate and the metric:

apple.groupby(['Year'])['Close'].mean()

We can also do other calculations with other functions – for example, extracting the max per group:

apple.groupby(['Year'])['Close'].max()

This last instruction can be translated into the following SQL query: SELECT Year, max(Close) as max_close from apple group by Year .

We can also perform aggregations on multiple columns at the same time, for example:

Before we move on to the Combining tables section, let’s create two aggregated tables with information about the Dividends per share paid by each company after the year 2000:

ford['Year'] = pd.to_datetime(ford['Date']).dt.year
apple_dividends = apple.loc[apple.Year &gt;= 2000].groupby(['Year'])['Dividends'].sum()
ford_dividends = ford.loc[ford.Year &gt;= 2000].groupby(['Year'])['Dividends'].sum()

Ok, let’s take this last instruction a bit slow as we are compounding on a couple of things we’ve learned throughout the blog post:

  • First, we are filtering our data for every row after 2000: apple.loc[apple.Year >= 2000]
  • This will return a DataFrame where we can apply groupby and sumlater, using groupby(['Year'])['Dividends'].sum()

Let’s visualize our apple_dividends DataFrame (?):

When we perform group by operations that return a single column, we output a pd.Series and not a DataFrame. We can explicitly convert the object into a DataFrame by wrapping it into a pd.DataFrame function:

apple_dividends = pd.DataFrame(apple_dividends)
ford_dividends = pd.DataFrame(ford_dividends)

Now, in DataFrame format – looking at the apple_dividends object:

It would be interesting to combine these DataFrames into a single object. That’s something we will take care of in the next section!

Combining Objects

In this section, I’m going to show you how to combine DataFrames in multiple ways:

  • Vertically, by stacking or appending them.
  • Horizontally by using a join.

First, let’s learn how to stack tables vertically (similarly to a UNION operator) – we can do that by providing pd.concat :

pd.concat([apple_dividends,ford_dividends])

This will create a DataFrame with the dividend data for both companies. In the current format, it will be hard to understand which company the row refers to. An alternative is to create a new column in the original DataFrames that state the name of the company:

apple_dividends['company'] = 'apple'
ford_dividends['company'] = 'ford'

pd.concat([apple_dividends,ford_dividends])

combined_dividends = pd.concat([apple_dividends,ford_dividends])

The indexes repeat themselves, which may be weird, although it may come in handy when indexing data from the same year:

combined_dividends.loc[2015]

We could also prefer to have this data with two time series side by side. To do that, we have to change our original data a bit and think about the process in the context of a table join. If we want to keep the year as index, we can use the merge function to combine the data:

apple_dividends.merge(ford_dividends, left_index=True, right_index=True)

The example above is similar to doing the query:

SELECT a.Dividends as Dividends_x, a.company as company_x, b.Dividends as dividends_y, b.company as company_y
from apple_dividends as a
inner join ford_dividends as b
on a.Year = b.Year

As pandasdoes not support two columns with the same name during a merge, Python automatically adds an _x or _y suffix to differentiate the columns based on their origin in the tables.

Alternatively, we can also use the convenient on argument, that supports joins by any column. To have our Year as column, we can reset the index of both tables:

apple_dividends.reset_index(inplace=True)
ford_dividends.reset_index(inplace=True)

Pay attention to the inplace=True in this query. This is a very important property in Python! Some operations may be done inplace, meaning that the object is changed dynamically, without a need to reassign it. In the example above, we just reset the index of both tables, turning the Year into a column:

Now, we can use the Year in the merge function, explicitly:

joined_dividends = apple_dividends.merge(ford_dividends, on='Year')

This may be useful if you want to combine data by a specific column that does not belong to the index of the DataFrame.

Having our «table saved, let’s finish this blog post by looking at some simple pandas plotting capabilities!

Plotting Capabilities

pandas contains an awesome integration with Python matplotlib . A cool way to plot data is just to call .plot() method:

joined_dividends.Dividends_x.plot()

And if I call a plot on both pandas series, they will show up in the same plot at the same time, enabling us to compare the dividends of both companies:

joined_dividends.Dividends_x.plot()
joined_dividends.Dividends_y.plot()

This plot is still a bit incomplete – we don’t have any title, axis legend, x-axis label or indication of which company each line refers to. Can we improve it using the .plot() function?

For sure! Instead of calling the .plot method on the column, we can call it method on the entire DataFrame and control x and y axis based on that:

import matplotlib.pyplot as plt

joined_dividends.plot(x='Year', 
 y=['Dividends_x','Dividends_y'], 
 xlabel='Year', 
 ylabel='Dividends Value',
 title='Annual Dividends by Company')

plt.legend(['Apple','Ford'])

There are some properties that we can’t change on the .plot() method. For example, adding a custom legend to the lines must be done using the matplotlib library, by calling plt.legend() after creating our plot.

We can also change the type of plot by using the kind argument – for example, let’s view the dividends in a bar plot:

joined_dividends.plot(x='Year', 
                      y=['Dividends_x','Dividends_y'], 
                      xlabel='Year', 
                      ylabel='Dividends Value',
                      title='Annual Dividends by Company',
                     kind='bar')

plt.legend(['Apple','Ford'])

Although limited, pandas plotting capabilities add an extra flavor to the library’s features and it is very handy, particularly for plots related to a quick exploratory analysis of the data. The downside of using the pandas plotting API is that there are multiple matplotlib features that aren’t available, so it is normal that we import the two libraries, particularly when we want to create more complex plots.


And that’s it! Thank you for taking the time to read this post and I hope you’ve enjoyed learning about pandas.

This library is a must for every professional or student that wants to work with DataFrames inside Python. Getting to know this object will also be very important for working with other frameworks (such as spark) or languages (such as R). Using pandashas become a very crucial skill for data scientists, engineers and analysts. I do believe that pandas will continue to be used throughout the next decade, as it is already engrained in a lot of Data Science pipelines.

On the downside, there’s a couple of limitations of pandas , namely:

  • Memory Usage: Pandas may have a bad performance for large and diverse DataFrames. Other frameworks like PySpark may be more suited for these operations.
  • Mutability: For people that never worked with Python, the mutability property may be confusing, at first. Handing mutability incorrectly is one of the most common mistakes peopl make when working with the library.
  • Not able to deal with unstructured data. Although pandas may support lists or dictionaries as column, storing unstructured data inside it is a massive headache and a process that is error prone.

As a summary, let’s detail some of the topics that we’ve approached during the blog post:

  • Dealing with the unidimensional pandasobject, the Series, an object that works similarly to R vectors.
  • Creating, indexing and doing all sorts of operations with the main pandas object, the DataFrame.
  • Read external objects, such as CSV files into DataFrameobjects.
  • Combining different DataFrames, both vertically (append) and horizontally (joining data).
  • Plotting data using the matplotlib API in pandas.

After reading this post, I recommend that you read the pandasofficial documentation and experiment with a few functions that we haven’t covered in the blog post, such as windowing operations or DataFrame reshaping.

Additionally, if you are on the path towards becoming a data scientists, you may find it relevant to learn about NumPy, Matplotlib, scikit-learn or PySpark.

Do you think I’m missing something in this Comprehensive Guide? Let me know in the comments below so I can add it!

If you would like to drop by my Python courses, feel free to join my free course here (Python For Busy People – Python Introduction in 2 Hours) or a longer 16 hour version (The Complete Python Bootcamp for Beginners) where you have a 30-day refund policy if it’s not a fit for you. My Python courses are suitable for beginners/mid-level developers and I would love to have you on my class!

The dataset used on this post is licensed under a Creative Commons Attribution CC0 Public Domain.


Related Articles