The world’s leading publication for data science, AI, and ML professionals.

8 Useful Pandas Features for Data-Set Handling

The FIFA World Cup and Pandas- Joga bonito

Image Courtesy of Fauzan Saari via Unsplash
Image Courtesy of Fauzan Saari via Unsplash

Introduction

This article presents 8 simple, but useful Pandas operations which showcase how the Python‘s Pandas library can be usefully used for data-set exploration. The Data-set I will use for this tutorial piece is entitled ‘International football results from 1872 to 2019‘ and can be sourced [here](http://International football results from 1872 to 2019), in case any of the code snippet examples presented would like to be tested.

1. The parse_dates parameter of the Pandas read_csv method.

Oftentimes when I first import a data-set, I read the first 5 rows using the head() method. The example shown, informs me that I have a date column. Ideally, for date-time based operations I would like this column to have a date-time datatype.

Before assigning the data-set to a variable name, I quickly view the columns to get a sense of the data-set
Before assigning the data-set to a variable name, I quickly view the columns to get a sense of the data-set

To do this when I first import my data-set, you can simply make use of the parse_dates optional parameter of the Pandas read_csv method. This parameter takes a list where column name(s) that you would like to have as a datetime datatype are listed as a string inside the list.

if multiple columns are to be parsed as a datetime datatype, simply pass each column name into the list separated by a comma. This prevents re-assigning existing columns, using the pd.to_datetime method. As an example, if I simply read in the data, excluding the parse_dates parameter, I would have to make use of the pd.to_datetime method to achieve the same end result.

To confirm the column has been successfully converted into a datetime datatype series, simply call the dtypes attribute on the DataFrame, which I have assigned the variable name of football here.

(N.B: I have created a new column called match goals, which will be used in later steps)

The parse_dates parameter can simplify and cut down on the amount of code that needs to be written when parsing columns which require a datetime datatype. The pd.to_datetime is an equivalent way of executing this conversion.
The parse_dates parameter can simplify and cut down on the amount of code that needs to be written when parsing columns which require a datetime datatype. The pd.to_datetime is an equivalent way of executing this conversion.

2. The between method on a Pandas Series

Now that the date column has a datetime datatype, I can exploit this by using the between method on my data-set.

As my focus is on the FIFA World Cup, I filter the football DataFrame, and create a boolean series where True values are returned if the column value in the tournament column matches ‘FIFA World Cup’. I then assign this Boolean series to a variable, named world cup matches, and filter the football DataFrame using this Boolean series. This gives me a new DataFrame which I call world_cup for simplicity.

I can now use the between method on the date column in my world_cup Dataframe, to find out the number of matches that took place between 2 specified dates.

The first argument to the between method is a lower bound string date and the second argument is an upper bound string date. The between method returns a Boolean series which I assign to the variable games_range. I then filter the world_cup DataFrame with this Boolean Series.

For clarity, I have chosen dates between the 1st of January 2010, and the 1st of January 2016, as I know there were two World Cups between these dates; one in South Africa in 2010, and another in 2014 in Brazil. In the modern format, each of these World Cups consists of 64 games. The games range variable I filter my World Cup DataFrame with should return 128 rows. Indeed it does as shown in the code snippet presented.

N.B: the lower bound is inclusive and the upper bound is exclusive for the between method.

2.b Filtering the date

Additionally, comparison operators can be used when working with datetime datatypes. Here, I create a Boolean series which returns True values if the ‘date’ is greater, i.e later than the 1st of January 2018. I filter my world_cup DataFrame with this Boolean Series and quickly check its shape using the shape attribute. This informs me that 64 rows in the world_cup DataFrame correspond to a date later than the 1st of January 2018. This makes sense on the basis that the 2018 World Cup in Russia took place in the Summer of that year!

3. The pivot_table method

The pivot table method is ideal for aggregating to look at values as a whole. The pivot_tables method is called directly on the DataFrame, here ‘recent_wc_games’ (from the previous section, based on the 2 World Cups, see commented out code below). In the example shown, I aggregate on the ‘match_goals’ column which corresponds to the number of match goals scored per match.

Those values can now be grouped based on country and city. These columns are passed to the index parameter of the pivot table method as a list of strings to create a multi-series index. I then select the aggregation functions I would like to run on my grouped data. Here, I pass a list to the aggfunc parameter which gives a count and mean of the the grouped data.

This data output requires some sorting, but shows how simply the data can be summarised
This data output requires some sorting, but shows how simply the data can be summarised

4. Sorting

_-specifying via sortvalues with a tuple when a multi-index column is present.

Now the data has been grouped based on country and city for the mean number of match goals and the number of matches, I can sort the Data to make it more interpret-able. In addition, I can use a Pandas one-liner to visualize the data.

I use to the sort_values method, but importantly, to sort by the match goals scored I need to pass in a tuple. I set the ascending parameter to False to sort the data from highest mean match goals to lowest. Here, it seems Porto Alegre in Brazil may have played host to the most entertaining matches at the FIFA World Cup 2014 in Brazil.

Porto Alegre, the best place to have watched a FIFA World Cup match?
Porto Alegre, the best place to have watched a FIFA World Cup match?

5. Plotting

-remember to import matplotlib.

To plot these sorted values by means match goals scored, I simply have to specify a tuple to the y parameter in the plot method. The first element in the tuple is the outermost column-index value, followed by the inner column index, as shown. To produce a vertical bar chart I pass bar in quotation as the argument to the kind parameter.

Certain version of the Jupyter Notebook may require the cell magic %matplotlib inline command, so I have included it in this example.
Certain version of the Jupyter Notebook may require the cell magic %matplotlib inline command, so I have included it in this example.

6. The get_group method

To utilize the get_group method, I first create a groupby object of the tournament column in the original football DataFrame that I first introduced. I assign this groupby object the variable name ‘tournaments’. The python len function and the nunique series method, tells me that there are over 40,000 rows, but only 109 unique tournaments in the original football DataFrame.

I now use the get_group method on the variable tournaments, which points to the groupby object to return a DataFrame in which only ‘FIFA World Cup’ is listed as the tournament type. Out of the 40839 rows in the original Dataframe, 900 correspond to FIFA World Cup data.

I can now use this new DataFrame (which I have called fifa), and perform any operation on it. For example, I can groupby country to determine the mean number of match goals scored as shown in the code snippet below.

The get_group method works very nicely in acquiring new DataFrames. I could equally write tournaments.get(‘Copa America’), to get all the rows which corresponds to that tournament from the original football DataFrame.

To produce this output I have method chained, and ended with the head method. This explains why only 5 horizontal bar columns are shown.
To produce this output I have method chained, and ended with the head method. This explains why only 5 horizontal bar columns are shown.

7. Multiple Aggregation functions on a single column.

Often it is necessary to perform more than one aggregation function on a single column from a DataFrame. In the example presented, I would like to determine the number of records, mean, minimum and maximum match goals scored, for each country which hosted a World Cup match.

In order to do this, I create a groupby object, and pass the agg method, a dictionary, where the keys represents the column, and the value represents the aggregate functions I would like to perform. If I want to sort this output, it is necessary to pass a tuple to the sort_values method, where the first element is the column and the second is the aggregation function that I would like to sort by.

Interestingly, it seems like a World Cup match in Switzerland played host to a 12-goal thriller! Switzerland also had the most number of mean match goals, averaging over 5 per match! For interested parties, I did fact check these results on the 1954 FIFA World Cup in Switzerland, as the results even surprised me!

The output is sorted by the 'mean' aggregation function and the tail is shown, which defaults to the last 5 rows.
The output is sorted by the ‘mean’ aggregation function and the tail is shown, which defaults to the last 5 rows.

8. pd.cut to create categories

Finally, to include, the pd.cut method can be conveniently used to categorize data into user-defined bins. For this example, I have created a sequence of 4 bins, -1 to 1 goals (the first number is exclusive hence I cannot start the bin at 0), 1–3 goals, 3–5 goals and 5–13 goals for those outlier games!

I then give each bin an appropriate corresponding label name, and create a new series called ‘goal_category’. Random sampling for internal verification confirms that the column categories are correctly assigned. One final check I like to run is to check for any missing data, (null values/NaN) values in my new columns. I do this by calling the isna() method directly on the new series I have created, and this confirms no missing data.

Random sampling for internal verification confirms that the column categories are correctly assigned
Random sampling for internal verification confirms that the column categories are correctly assigned

Quick Summary

This article has demonstrated a variety of useful methods and their parameters that can be intuitively used to investigate a data-set. There are multiples ways of doing each of these steps presented, but I often stick to the methods shown to aid readability.


Related Articles