Matplotlib — Making data visualization interesting
Data visualization is a key step to understand the dataset and draw inferences from it. While one can always closely inspect the data row by row, cell by cell, it’s often a tedious task and does not highlight the big picture. Visuals on the other hand, define data in a form that is easy to understand with just a glance and keeps the audience engaged.
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. — matplotlib.org
Matplotlib is a basic library that provides options for various plots along with extensive customizations in the form of labels, title, font size etc. I watched numerous videos and read articles online for visualization. To understand Matplotlib better, I took the population density dataset from Kaggle and started creating my own visualizations. This article highlights the plots I drew, including the customizations and inferences that I drew from the data.
The complete work is present as a GitHub repository as Visualization using Matplotlib for quick reference. Let’s begin!
Import libraries
As always, we first need to import all necessary libraries. We importNumpy
and Pandas
libraries for data handling. We then import matplotlib
and use its module pyplot
for plotting our data and cm
for colour palette. The statement %matplotlib inline
ensures that all plots appear inline in the notebook.
Import dataset
The first four rows of the file dataset.csv
are not required so we can import the data into the variable dataset
while skipping the first four rows. We then use head(5)
to check the data.
We immediately see some columns that are not relevant to us. Firstly, we can keep the Country Name
but can drop the Country Code
. As we know we are dealing with population densities, we can drop columns Indicator Name
and Indicator Code
. Next, the columns for year 1960
and 2016
have NaN
values. NaN
stands for Not a Number and we should drop these columns as they provide no information. Lastly, there is an unnamed column which also has NaN
values so we can drop Unnamed: 61
as well.
We drop all rows where there may be blank or null values using the dropna
method and check if all columns have no null values using dataset.isnull().sum()
. We see that all show 0
null values.
We are now ready to visualize our data.
Visualisation
Now, we will use Matplotlib to create our plots and use visualizations to draw meaningful conclusions.
Line Plot
We first analyse Aruba
's population density over the years using a line plot. We take the years along the x-axis and population density along the y-axis.
We select the x values as dataset.columns[1:]
which selects all columns except the first as we need only years and not the Country Name
column. Next, we select y values as dataset.iloc[0][1:]
which selects the first column, that is country Aruba
and all columns except the first. The country name can be derived using dataset[0][0]
. To plot the graph, we simply use the plot
function and define the parameters as x and y for the x-axis and y-axis respectively.
Congratulations!! Our first plot is ready!!
Yeah, it does the job but it’s so difficult to understand it. There are no labels and the axis values overlap. This is where the customization power of matplotlib will come in handy. Let’s make it a bit more interprettable.
The rcParams
allow us to modify the figure size, font size and a lot more. We then add a title, xlabel, and ylabel. xticks
allows us to define the angle of rotation of our text, which we set to 90 degrees. We then plot again and define the width of the line to 4
.
We can now see that the plot is much more descriptive with just a few modifications.
From the plot above, we can see that there was a steady rise in the population density till the 1980s. From 1990s, the density shot up drastically and continued to show the same growth till it became steady in mid 2000s.
We can also use line graphs to see the trend amongst various countries. We take the first 5 countries and compare their population density growth. As we are now dealing with multiple lines on a single graph, we should use different colors for each country and also define a legend that assigns a unique color to each country.
Here, we loop through the first 5 rows of the dataset and plot the values as line graph. We select the colors using the cm
package of matplotlib
and its rainbow
method. In the plot method, we must specify the label
parameter as it ensures that when legend is enabled, the legend names are shown. Legend is enabled using the method legend()
where I have specified one property, namely size of 24
.
We can see that for all 5 countries, Aruba, Andorra, Afghanistan, Angola and Albania there has been a rise in the population density.
Now that we have plotted all line graphs on a single plot, it’s very easy to see that Aruba has always had higher population density as compared to other 4 countries.
Bar Plot
Bar plot can be easily created using the method bar()
with the relevant arguments. We begin by plotting the population density of all countries for the year 2015.
While the plot does try to provide a lot of information at one time, it is for the same reason that it lacks to provide any useful information. There is so much overlapping in the x-axis labels that it renders the whole plot useless. We not only need to visualize our dataset but also visualize cleverly the important parts of our dataset. It’s better if we sort the top 10 countries with the maximum density and take a look at them.
We initially sort the countries by using the sort_values
method based on the population density of the year 2015
and in descending
order. We then select the top 10 using head(10)
method. We then plot our bar plot with this new data.
The information is so much clear now. The data is equally spaced and distinctly represented using different colors.
Macao SAR and Monaco have the highest population density across all other available countries in the dataset for the year 2015.
Scatter Plot
A scatter plot is a plot which displays the data as points in the open space. Scatter plots are really useful as we can specify the size of each data point based on certain value and the data can itself represent it’s distinctiveness from the other points.
We now analyze the countries where the average population density is less than 10 people per square Km of land area for all the years.
Firstly, I sum all data across each row using dataset.sum(axis = 1)
and then use the lambda
method to say that for each final value divide it by the number of columns to get the average. I then compare this value to be less than or equal to 10 and use this as a index for the dataset. The resultant is all countries whose average population density is less than or equal to 10. I now calculate this value again and this time store it in the variable consolidated_data
. The method to plot a scatter plot is similar to other plots with one minor change. We can now define the size of each data point using the parameter s
. I have kept the size to be equivalent to population density. As the values were small, I multiplied the size by 20 each so the difference is more prominent.
You can see that each data point is represented on the graph with its own size based on its density value.
Greenland appears to have the least average population density amongst all countries in the dataset.
In-depth Analysis
Now, we’ll take a deeper dive into the dataset and see if we can draw more conclusions.
Descriptive Analysis
We can see if there has been any change in the maximum and minimum density values across the world. Thus, we would need to calculate the range of values.
We use the min()
and max()
methods to calculate the minimum and maximum across each column except Country Name
. We then find the range in the variable diff
. We then calculate the minimum value of maximums of each column and save it to minOfMax
. When we will plot the bar graph, we will subtract this value from all ranges and this will make sure we are comparing all ranges with respective to the least range year. This is ensured by using the apply(lambda x: x-minOfMax)
method inside bar plot.
We see that for the year 2001 there was the maximum gap between the most densely populated country and the least densely populated country and then there was a sharp fall in the year 2002.
Population vs Population Density
It’ll be amazing to explore if the Population Density is indeed a good measure and if it is also reflective of the population of the country. As this dataset does not have the population or area, we need to fetch it from some other source. Here, we’ll use BeautifulSoup to extract the land areas from Wikipedia page and use it to calculate the population. We’ll then compare the population and population density. You can refer my article on Web Scraping for a quick kickstart.
The final set has a list of 170 countries for which we have complete information available including the land area in Km. But as it’ll be difficult to understand and interpret so many countries at one time, we’ll take a look at the first 20.
We now compare the Population and Population Density of 20 countries by plotting bar graphs for them side by side. We find the Population
by multiplying the area
by population density for year 2015. We can divide the canvas into multiple subplots. The method subplot(rows, columns, index)
is used to define a subplot. Here, we create a canvas of 4 plots identified by 2 rows and 2 columns. The third argument tells which graph we are referring to here. We plot the data for 20 countries in two graphs at index 1 and 2.
We can see that Population Density
is not always the correct measure to describe the Population
of a country, for example the country Bahrain
.
Even though the population density is very high for it, the population for Bangladesh is much higher than Bahrain.
Conclusion
Here, we used matplotlib
package to design and create plots which helped us understand our dataset better.
Hope you liked my work. Please feel free to share your thoughts and suggestions.