The world’s leading publication for data science, AI, and ML professionals.

Python for Beginners – Pandas

The friendly Panda helping us when dealing with spreadsheet data

In the last article we saw how Numpy can make our life’s a lot easier when dealing with a large amount of numerical data. But sometimes, there are also other types of data involved, and then, Numpy is not always the best solution. In such situations, Pandas comes in handy. It has many application possibilities, but my preferred case is to read files than are not numerical only, i.e., .csv or .xlsx files that contain some kind of observations (numerical data) but also textual information or metadata about those observations.

This article is structured as follows:

  • Introduction
  • Pandas Series and Pandas Dataframe
  • Reading data
  • Inspecting data
  • Selecting data
  • Simple plots
  • Conclusion

Enjoy the reading! 🙂

Photo by billow926 on Unsplash
Photo by billow926 on Unsplash

Introduction

What is Pandas? Pandas is an extremely popular library built upon Numpy, for handling tabular data, data manipulation and analysis. Probably the best thing about Pandas is that it stores data as a Python object with rows and columns, very similar to data stored in Excel files. Also, this way we can easily visualize our data, making our job a lot easier then handling data in form of lists or dictionaries. Also, the advantage over Numpy is that it handles multiple data types (i.e., strings), not only numerical data, although I need to mention the downside, this makes it slower in comparison to Numpy.

How to install Pandas?

Well, if using Anaconda, Pandas is preinstalled in the base environment. However, more often than not, it’s good practice to create new environments for your new projects. To install Pandas in a new environment we activate the environment and then we type:

conda install pandas

If pip is being used, Pandas can be installed by typing:

pip install pandas

How to import Pandas?

When importing certain libraries, including Pandas, we follow a convention, basically this means we use well established abbreviations for libraries. In the case of Pandas we use "pd".

import pandas as pd


Pandas Series and Pandas Dataframe

Before we start dealing with some of Pandas’ tools, we need mention the two data structures Pandas uses to store data, the Pandas Series and the Pandas Dataframe. Think of Pandas Series as an 1 column Excel spreadsheet, with an additional index column, or even better, if you are familiar with Numpy think of an one dimensional array.

Imagine daily meteorological observations at a point location, for example wind speed for the period of two weeks. Let’s see an example.

Image by author
Image by author

We use the pd.Series() command, and provide a list, in this case a list of floats with a length of 14. When printed, we also get the indices of each element in the Series. We also can create a Series from an 1-dim Numpy array. For the sake of efficiency, I shall use the same list of wind speeds.

Try printing out the newly created Series, and compare it with the previous result.

You may think now, what if we have multiple Series, what if we also have for example precipitation measurements, well then, Pandas stores our data in a Dataframe. I’m pretty sure, you get the point, a Dataframe is just a container for multiple Series. Think of it as an Excel spreadsheet with multiple columns, and one index column, or as an N-dimensional Numpy array. Let’s see an example.

To create a DataFrame we use the pd.DataFrame() command, and provide a dictionary where the keys represent our column names, and the values are the before created Numpy arrays (our data).

Image by author
Image by author

Looks kind of similar to an Numpy N-dimensional array. Each Pandas DataFrame consists of following components: index, columns and data (values). The indices are the labels ("names") of each row, while each column also has their label ("name"). The indices always start with a zero (0) up to n-1, where n is the number of rows.

Image by author
Image by author

This DataFrame contains only floats (numbers), but any data type is allowed (integers, strings, Booleans..), although if combining multiple data types inside a column, care has to be taken.

To avoid, unexpected behaviour, we can also specify the column data type as follows:

You can also use the pd.Series() command with a dictionary, if you want to name the created Series.

There are a lot more options and possibilities of creating Series and DataFrames, you can find those on the official Pandas sites. But When working on Data Science or Data Analysis tasks often we have to deal with large datasets, stored as comma separated values (.csv) or Excel spreadsheets (.xlsx). Let’s see how to read such files into a Pandas Dataframe .


Reading data

To read a .csv file, we use the pd.read_csv() command. When files are properly formatted, Pandas figures out which separator, column names, data types etc. are present in our data. All of the mentioned can also be provided, and many more arguments can be added to pd.read_csv(), in order to widen the possibilities of reading differently formatted files. Also, providing the dtype argument for each (or some) column(s), significantly improves the speed of reading the file but also lowers the memory usage by quite a margin, find out more in this great article.

To see how pd.read_csv() works, I’ve created a dataset with some meteorological observations for entire year 2018 which we are going to read in as a Pandas DataFrame and work on. To visualize the newly created DataFrame Pandas offers some handy methods, df.head() and df.tail() (df stands for the name of our dataframe, in our case, weather_data).

Here, few things are important: we need to make sure that out test_data file is in the same folder as our Jupyter notebook (or .py file) and second that we correctly spell the name of the file, and also provide the extension .csv.

Image by author
Image by author

The .head() method prints out the first 5 rows (by default, we can change by entering any integer in the brackets), while .tail() prints out the last 5 rows (also by default).

In same manner as for .csv files, we can also read Excel files.

weather_excel = pd.read_excel("test_data.xlsx")

Sometimes, an error can occur, "XLRDError:Excel xlsx file; not supported", then make sure to install the openpyxl library to your environment of choice, and use the command:

weather_excel = pd.read_excel("test_data.xlsx", engine="openpyxl")

I’ve used a basic example of reading comma separated value files, or Excel files, but make sure that you check the official documents (for .csv here, and for .xlsx here) of Pandas to find out all the arguments we can use with those two methods. Also, you can check one of my previous articles on opening and processing not so nicely formatted files with Pandas.


Inspecting data

After we have successfully read the data, we need to take a closer look at our DataFrame. We already printed first 5 and last five rows, but what about the rest? It’s always good practice to check the shape of your dataframe, similarly, like we did for Numpy ndarrays. Therefore, we use the df.shape method, which returns the number of rows, and columns of the DataFrame.

Image by author
Image by author

The dataframe has 365 rows, and 5 columns, since the year 2018 had 365 days, and we have observations for each day. Also, we have a date column, and 4 different observations.

We can get even more information about our dataframe by using the df.info() method, which yields the length, number and data type of he columns, and the size in the memory.

Image by author
Image by author

In Hydrological (and Meteorological) tasks, we are often interested in statistical parameters of our dataset, Pandas has a solution for this problem to. It provides the most commonly used statistical parameters as methods, mean, median, minimum, maximum, skewness etc. Let’s see a few examples.

Image by author
Image by author
Image by author
Image by author
Image by author
Image by author

Interestingly, df.max() also evaluates the object type column with the dates. Since we have not converted the dates to the datetime format, the maximum value is not 12/31/2018, but rather the maximum of a string, and that’s 9, so 9/9/2018. In order to convert the dates to the datetime format (which Python then evaluates as real dates) we use the pd.to_datetime() functions, as fallows:

Try now, to print out the first 5 rows of the dataframe, and compare it to the previous example, before we converted the dates to datetime format. Also, we can see a change in the data type of the columns.

Image by author
Image by author

Let’s now go a step further, and learn to select certain parts of the data from our weather_data dataframe.


Selecting data

To select data from a DataFrame, we have several options, like i.e. using square brackets [], the .loc[] or the .iloc[] functions. There are certainly more possibilities, but those three should cover most of our needs.

Square brackets

Using the square brackets, we can either select one (returns a Pandas Series) or multiple columns (returns a Pandas DataFrame). Let’s say we are only interested in the precipitation data.

Image by author
Image by author

Or as mentioned, we can also provide a list of column names inside the square brackets, which then returns a new DataFrame.

Image by author
Image by author

It is also possible to select data by providing the indices to the square brackets, but this method is not very often used.

.loc[]

The .loc method works in a bit different way. It’s meant to select data by the index label, also, the data type depends on the index data type of the DataFrame on which we work on. We use it like follows:

Image by author
Image by author

It also works with DataFrames with other types of indices (string, datetimes, etc.).

Selecting only one index label, returns a Series. Selecting two label, or a subset, returns a new DataFrame.

Image by author
Image by author

Please notice, that when providing multiple labels, we need to store them in a list (hence the double square brackets). Also, .loc includes the last values, opposite to indexing a list for example.

With .loc we can also select multiple rows and columns!

Image by author
Image by author

So, .loc can only be used with labels (integers, strings, datetimes…), it can select rows, but also columns and we can provide the selection as single label, subset or a list.

.iloc[]

The .iloc method works very similar to the .loc method, but it only supports integer locations of the row (whatever the data type the index of the DataFrame is, .iloc is selecting data by the integer location). Using only one value (index) will return a Series, while using multiple indices or a subset will return a DataFrame.

Image by author
Image by author

.iloc also can be used to select columns simultaneously, but in the same manner as for rows, by their integer location. Let’s take a look.

Image by author
Image by author

So in the above example we have selected the 7th, 8th, 9th and 10th rows, and 1st to 3rd column. (both the counting for rows and columns tart from zero)


Simple plots

It’s good to get the statistics for the data, and to know how to select certain parts of the DataFrame, but it’s certainly nice to see some graphical interpretation of the dataset. So, as a final step of data exploration, let’s take a look to simple plots with Pandas. For this example, let’s plot out the wind speed for January 2018.

Image by author
Image by author

We can also plot multiple columns, let’s now see how warm (or cold) was the summer of 2018.

Image by author
Image by author

It is worth mentioning, that those plots, can be done by other ways of selecting data, can you think of any? Try it out, by yourself.

Precipitation is mostly plotted, as a bar plot. Since for this location, the Fall is in most cases pretty rainy, let’s see how rainy it was 2018, by plotting the precipitation in September and October as a bar plot.

Image by author
Image by author

We are not plotting as a bar plot, but the "step" argument is a nice workaround to get nice barplot like plots, without messing around with the labels. 🙂


Conclusion

Today we’ve looked into some basic Pandas functions, which are necessary when working with spreadsheet like data. I cannot emphasize enough that there are dozens more functions and methods, which all could not get covered and presented, but you can find them in the official Pandas documentations. Also, as you will work with your own data, you will encounter many of them, by your own.

But as a beginning step, to familiarize with our friend Pandas, you will certainly find the above tips useful.


For any questions or suggestions regarding this article or my other articles on Medium, feel free to contact me via LinkedIn.

Thank you for taking the time, Cheers! 🙂


Related Articles