Matplotlib — Making data visualization interesting

Published in

Towards Data Science

9 min readNov 27, 2018

Data visualization is a key step to understand the dataset and draw inferences from it. While one can always closely inspect the data row by row, cell by cell, it’s often a tedious task and does not highlight the big picture. Visuals on the other hand, define data in a form that is easy to understand with just a glance and keeps the audience engaged.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. — matplotlib.org

Matplotlib is a basic library that provides options for various plots along with extensive customizations in the form of labels, title, font size etc. I watched numerous videos and read articles online for visualization. To understand Matplotlib better, I took the population density dataset from Kaggle and started creating my own visualizations. This article highlights the plots I drew, including the customizations and inferences that I drew from the data.

The complete work is present as a GitHub repository as Visualization using Matplotlib for quick reference. Let’s begin!

Import libraries

As always, we first need to import all necessary libraries. We importNumpy and Pandas libraries for data handling. We then import matplotlib and use its module pyplot for plotting our data and cm for colour palette. The statement %matplotlib inline ensures that all plots appear inline in the notebook.

Import dataset

The first four rows of the file dataset.csv are not required so we can import the data into the variable dataset while skipping the first four rows. We then use head(5) to check the data.

We immediately see some columns that are not relevant to us. Firstly, we can keep the Country Name but can drop the Country Code. As we know we are dealing with population densities, we can drop columns Indicator Name and Indicator Code. Next, the columns for year 1960 and 2016 have NaN values. NaN stands for Not a Number and we should drop these columns as they provide no information. Lastly, there is an unnamed column which also has NaN values so we can drop Unnamed: 61 as well.

We drop all rows where there may be blank or null values using the dropna method and check if all columns have no null values using dataset.isnull().sum(). We see that all show 0 null values.

We are now ready to visualize our data.

Visualisation

Now, we will use Matplotlib to create our plots and use visualizations to draw meaningful conclusions.

Line Plot

We first analyse Aruba's population density over the years using a line plot. We take the years along the x-axis and population density along the y-axis.

We select the x values as dataset.columns[1:] which selects all columns except the first as we need only years and not the Country Name column. Next, we select y values as dataset.iloc[0][1:] which selects the first column, that is country Aruba and all columns except the first. The country name can be derived using dataset[0][0]. To plot the graph, we simply use the plot function and define the parameters as x and y for the x-axis and y-axis respectively.

Congratulations!! Our first plot is ready!!

Yeah, it does the job but it’s so difficult to understand it. There are no labels and the axis values overlap. This is where the customization power of matplotlib will come in handy. Let’s make it a bit more interprettable.

The rcParams allow us to modify the figure size, font size and a lot more. We then add a title, xlabel, and ylabel. xticks allows us to define the angle of rotation of our text, which we set to 90 degrees. We then plot again and define the width of the line to 4.

We can now see that the plot is much more descriptive with just a few modifications.

From the plot above, we can see that there was a steady rise in the population density till the 1980s. From 1990s, the density shot up drastically and continued to show the same growth till it became steady in mid 2000s.

We can also use line graphs to see the trend amongst various countries. We take the first 5 countries and compare their population density growth. As we are now dealing with multiple lines on a single graph, we should use different colors for each country and also define a legend that assigns a unique color to each country.

Here, we loop through the first 5 rows of the dataset and plot the values as line graph. We select the colors using the cm package of matplotlib and its rainbow method. In the plot method, we must specify the label parameter as it ensures that when legend is enabled, the legend names are shown. Legend is enabled using the method legend() where I have specified one property, namely size of 24.

Comparing 5 countries based on their Population Density

We can see that for all 5 countries, Aruba, Andorra, Afghanistan, Angola and Albania there has been a rise in the population density.

Now that we have plotted all line graphs on a single plot, it’s very easy to see that Aruba has always had higher population density as compared to other 4 countries.

Bar Plot

Bar plot can be easily created using the method bar() with the relevant arguments. We begin by plotting the population density of all countries for the year 2015.

While the plot does try to provide a lot of information at one time, it is for the same reason that it lacks to provide any useful information. There is so much overlapping in the x-axis labels that it renders the whole plot useless. We not only need to visualize our dataset but also visualize cleverly the important parts of our dataset. It’s better if we sort the top 10 countries with the maximum density and take a look at them.

We initially sort the countries by using the sort_values method based on the population density of the year 2015 and in descending order. We then select the top 10 using head(10) method. We then plot our bar plot with this new data.

Top 10 most densely populated countries for 2015

The information is so much clear now. The data is equally spaced and distinctly represented using different colors.

Macao SAR and Monaco have the highest population density across all other available countries in the dataset for the year 2015.

Scatter Plot

A scatter plot is a plot which displays the data as points in the open space. Scatter plots are really useful as we can specify the size of each data point based on certain value and the data can itself represent it’s distinctiveness from the other points.

We now analyze the countries where the average population density is less than 10 people per square Km of land area for all the years.

Firstly, I sum all data across each row using dataset.sum(axis = 1) and then use the lambda method to say that for each final value divide it by the number of columns to get the average. I then compare this value to be less than or equal to 10 and use this as a index for the dataset. The resultant is all countries whose average population density is less than or equal to 10. I now calculate this value again and this time store it in the variable consolidated_data. The method to plot a scatter plot is similar to other plots with one minor change. We can now define the size of each data point using the parameter s. I have kept the size to be equivalent to population density. As the values were small, I multiplied the size by 20 each so the difference is more prominent.

Countries with average Population Density less than equal to 10

You can see that each data point is represented on the graph with its own size based on its density value.

Greenland appears to have the least average population density amongst all countries in the dataset.

In-depth Analysis

Now, we’ll take a deeper dive into the dataset and see if we can draw more conclusions.

Descriptive Analysis

We can see if there has been any change in the maximum and minimum density values across the world. Thus, we would need to calculate the range of values.

We use the min() and max() methods to calculate the minimum and maximum across each column except Country Name. We then find the range in the variable diff. We then calculate the minimum value of maximums of each column and save it to minOfMax. When we will plot the bar graph, we will subtract this value from all ranges and this will make sure we are comparing all ranges with respective to the least range year. This is ensured by using the apply(lambda x: x-minOfMax) method inside bar plot.

We see that for the year 2001 there was the maximum gap between the most densely populated country and the least densely populated country and then there was a sharp fall in the year 2002.

Population vs Population Density

It’ll be amazing to explore if the Population Density is indeed a good measure and if it is also reflective of the population of the country. As this dataset does not have the population or area, we need to fetch it from some other source. Here, we’ll use BeautifulSoup to extract the land areas from Wikipedia page and use it to calculate the population. We’ll then compare the population and population density. You can refer my article on Web Scraping for a quick kickstart.

The final set has a list of 170 countries for which we have complete information available including the land area in Km. But as it’ll be difficult to understand and interpret so many countries at one time, we’ll take a look at the first 20.

We now compare the Population and Population Density of 20 countries by plotting bar graphs for them side by side. We find the Population by multiplying the area by population density for year 2015. We can divide the canvas into multiple subplots. The method subplot(rows, columns, index) is used to define a subplot. Here, we create a canvas of 4 plots identified by 2 rows and 2 columns. The third argument tells which graph we are referring to here. We plot the data for 20 countries in two graphs at index 1 and 2.

We can see that Population Density is not always the correct measure to describe the Population of a country, for example the country Bahrain.

Even though the population density is very high for it, the population for Bangladesh is much higher than Bahrain.

Conclusion

Here, we used matplotlib package to design and create plots which helped us understand our dataset better.

Hope you liked my work. Please feel free to share your thoughts and suggestions.