The world’s leading publication for data science, AI, and ML professionals.

Data analysis beginners: Python makes it easy

In this series, 'Spreadsheets to Python,' I explore the many joys and benefits of Python data analysis and encourage readers to try it…

SPREADSHEETS TO PYTHON

Photo by Hitesh Choudhary on Unsplash
Photo by Hitesh Choudhary on Unsplash

It’s a great time to be a data analyst. With so much data available for analysis at the click of a button, the opportunities for driving knowledge and having impact are limitless.

The technology we use to manage and dissect this data can make the task daunting or delightful.

One of the big advantages of using code over spreadsheets is how easy it is to reliably extract information from large datasets. Here, we will produce code that repeatably performs calculations on different sub-sets of data to produce interesting comparisons.

To do this in a spreadsheet would require manual manipulations that might introduce errors.

And because the code is immediately visible, it’s easy for another user to see what is going on and to check that it is correct; in a spreadsheet this would have to be done cell-by-cell, tab-by-tab for all the outputs.

I’ve been a lifelong fan of cricket, and cricket fans love statistics. I’m also a convert to the joys of Python, so I have decided to combine the two for this exercise.

I’ll show you how to extract information from a dataset using Python – and find out a bit more about one of England’s finest batmen: Joe Root.

As well as looking at his career statistics, we’ll see how Root’s batting average depends on factors such as where he plays, different opposition, when during the match he plays, etc.

Although the data used has only 200 rows and 11 columns, the methods used will easily scale.

And because much of the work is repetitive, we’ll write a short function to reliably extract the data we want, which makes everything much neater.

Let’s load some data

First we’ll need to open a Python environment. I use Filament, which allows me to combine the text and code for the blog, and also makes it easy to share with colleagues for feedback.

I have compiled a csv file containing Root’s batting data up until his most recent test match, against India, in early September. This replicates a pretty common situation: we have a csv file, and we need to choose how to view and analyse the data it contains. Let’s see how this works using Pandas in Python.

The first step is to load the modules we will require later on.

Now I’m going to read the csv file,

which gives me:

Image by author
Image by author

For clarity, I haven’t included all of the lines in the image above.

Next, as is nearly always the case, we need to do a little bit of data manipulation. Joe Root plays principally as a batsman, and we are interested in his batting average. This is calculated by dividing the total number of runs scored by the number of times he has been out. However, sometimes he finishes an innings ‘not out’. This is indicated in the Dismissal column, and by the * in Runs. These scores count towards the runs, but not towards the outs.

What I need is an easy way of counting up the number of outs. I will do this by creating a new column, Out, with a 1 if he was out, and a 0 if not. I will also create a new column, Runs 2, containing the numerical data only, i.e. without the *.

Let’s see this in action.

Image by author
Image by author

It’s useful to check some overall stats to make sure that the manipulations were done correctly.

This gives: Total Innings = 200; Total Runs = 9278; Not Out = 15; High Score = 254; Career Average = 50.15; Total Centuries = 23.

A quick look on the internet confirms these values.

Before moving on to more detailed analysis, let’s visualise some key data: runs, career average and recent average.

Image by author
Image by author

It’s really easy to produce attractive graphs in Python. Here we see Root’s excellent record in the first 120 innings, followed by a decline over the next 50 or so, and a recent return to form.

Let’s write some code to explore the data

Now, I want to do a deeper dive into the data. Let’s take a look at Root’s average against different opponents. I’ve commented the code slightly to show what is going on.

Image by author
Image by author

Leaving aside Bangladesh and Ireland, against who Root has only played 6 innings in total, we see that his averages against India and Sri Lanka are particularly high, possibly because of his well-known strength against spin bowling, whilst those against Australia and New Zealand are much lower than the others.

That’s cool. Now I want to try looking at his batting record in different innings of the match (for non-cricket-lovers, the match has four innings, and in normal circumstances each team bats in either 1 and 3 or 2 and 4).

It looks like we’re going to be using the above code quite a bit. Let’s define a function which allows us to choose the variable we want to compare against and outputs the relevant averages. This is really just a case of turning the previous code into a function, but I have also shortened the variable names.

It’s probably a good idea to check that this function gives the same results as above. We call the function as follows:

Image by author
Image by author

👍

Let’s use our new code

It’s now really simple to calculate Root’s average in different innings of the match. This is pretty interesting, as it shows that his average is much higher when batting in the second innings of a match than in the first or third, and is very low in the fourth. Perhaps Root (and England) don’t like chasing a total.

Image by author
Image by author

A cricket team has 11 players, each of whom takes a turn at batting. The recognised batsmen usually play in positions 1–6, and Root has occupied all of them, preferring to play at 4 even though, as we can see below, his best position is actually at number 5.

Image by author
Image by author

Now that we have a function, it’s tempting to run it on all the different column headings in the original table, but I’m not going to do that here (I have done it though, and the results are in the Appendix below).

Instead, let’s have a look at how easy it now is to compare records against different countries. Root has played almost the same number of innings against India and Australia, but with very different averages.

We can quickly produce Data Frames containing just the information from matches against these two countries.

Now, our function can be used on each of these new Data Frames to extract more detailed comparisons of Root’s performance against them. To demonstrate, here is a comparison of his average in different innings of the match:

Joe Root average against India - image by author
Joe Root average against India – image by author
Joe Root average against Australia - image by author
Joe Root average against Australia – image by author

And a comparison of numbers of centuries:

Centuries v India = 8; Centuries v Australia = 3.

The convenience of doing this analysis in Python is that we can use these functions to quickly explore any other aspects of the data that we want.

Conclusion

I may not have won you over as far as cricket goes, but I hope I have illustrated how simple and useful data analytics in Python can be.

Using code allows us to extract information from the original csv file and to perform calculations from this information. It was easy to define a function to do what we wanted, and then to use that function to explore the effect of different factors, or variables, on the outcome. To do this in a spreadsheet would require time-consuming manual work.

Finally, because we’re not working in the same file as the original data, we don’t have to worry about accidentally changing something – all the steps in the analysis are repeatable.

All of this code could easily be used on much larger amounts of data, which would be more cumbersome in a spreadsheet, if the file could be opened at all. There is even a library, Dask, for handling really enormous data.

I hope this post has shown some how easy it is to use Python to explore data, and inspired you to try it out for yourself. For other examples to follow along with, check out my most recent ‘Spreadsheets to Python’ article.


The text, code and images for this post were created using Filament, the all-in-one workspace for data, analysis and reporting. To find out more please visit our website. Filament is currently running a closed Beta program; the first 100 people to request access using referral code TDSFILAMENT can skip the waitlist and get early access.

Thanks to Bonnie Sellers Wielewski and Euan Wielewski


Stay connected


Appendix – a few more interesting records

Home or Overseas

Image by author
Image by author

How Dismissed

Image by author
Image by author

Ground

Image by author
Image by author

Related Articles