Visualize World Trends using Seaborn in Python

Sambit Mahapatra
Towards Data Science

--

Study and analysis of the data is the very first step of any data science work. You need to get the general information about the nature and distribution of the data to plan your workflow accordingly. This is where visualization comes in as we say “a picture says thousand words”. With informative plots, it is easier to gain insights from the data and also to convey the insights to others.

In this post, we will see how to gain insights about world trends data using rich visualization. The data-set contains the country names, with country codes, stats of internet users, birth rates and life expectancy in 1960 and 2013. The data set and codes can be found in the github link-

Before analyzing data, we need to import all the required dependencies first. For rich visualization, we will use seaborn here which works on the matplotlib library to give fancier and informative plots.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Here I am using python 3.5, numpy 1.13, pandas 0.18, matplotlib 1.5, seaborn 0.8 and statsmodel 0.8.0 . While doing anything I always face a problem of so many deprecation warnings due to some updates in the packages or for version compatibility etc. These may look so frustrating while coding. Let’s ignore all the warnings, so that it won’t display on the screen.

import warnings
warnings.filterwarnings("ignore")

Now first load the data to a data frame using the read_csv function of pandas.

df = pd.read_csv("world_trend_survey.csv")
df.head() #to show top 5 rows

output-

We can have a look on the data distribution of number of Internet Users, Birth rate, average Life Expectancy in 1960 and average Life Expectancy in 2013 using the distribution plot of Seaborn. To have the two plots side by side, we can crate a subplot frame for two subplots to be plotted.

f, ax = plt.subplots(2,2,figsize=(8,4))
vis1 = sns.distplot(df["Internet users"],bins=10, ax= ax[0][0])
vis2 = sns.distplot(df["Birth rate"],bins=10, ax=ax[0][1])
vis3 = sns.distplot(df["LifeExp1960"],bins=10, ax=ax[1][0])
vis4 = sns.distplot(df["LifeExp2013"],bins=10, ax=ax[1][1])

The plot looks like-

An interesting distribution can be seen at average Life Expectancy. In 1960, data distribution is uniform where as in 2013, it resembles with normal distribution. Also, we can plot box plots between Income groups and Internet users to get the co-relation between them. The plotted figure can be saved to local file to by making it a figure object first.

vis5 = sns.boxplot(data = df, x = "Income Group", y = "Birth rate")
fig = vis5.get_figure()
fig.savefig("fig1.png")

The plot looks like-

Income Group vs Internet Users

From the plot it is quite clear that the using of internet increase with the income of a person. Similar insights can be derived by plotting Income Group vs Birth rate, Country Region vs Internet users, Country Region vs Birth rate.

Country Region vs Internet Users

After plotting all these 4 plots, it seems there is a relation between number of internet users and birth rate. Let’s plot a joint plot to see their relation.

vis6 = sns.jointplot(data = df, x = "Internet users", y = "Birth rate", kind='kde') #here kde means kernel density plots

Here, pearson coefficient -0.82 means its a linear relationship with negative slope. The p value<0.001 means the information are statistically highly significant.

Internet Users vs Birth Rate

From the plot it seems, when people uses internet more they perhaps don’t get much time to have kids :-D. Also we had seen that High income group has people uses Internet more. So, High income group people have less birthrate which is quite intuitive as they would have more awareness. Similar other insights also can be concluded from these plots. To make these conclusions stronger let’s plot 2D linear plots:

vis7 = sns.lmplot(data = df, x = "Internet users", y = "Birth rate",fit_reg=False, hue = "Income Group",size = 6, aspect=1.5, scatter_kws = {'s':200}, )

Here, ‘hue’ is used to color the markers differently for different categories supplied, ‘aspect’ is the aspect ration between length and width and ‘scatter_kws’ is the key word arguments for scatter plot which is called from matplotlib. ‘s’ means the marker size is set to 200 here to make the plot more informative.

Internet Users vs Birth Rate

As it is very clear from here, low income group has high birth rate and less internet usage. This is quite opposite for high income group. Where as, for lower middle income group and upper middle income group the results are quite diversified. Perhaps Country region factor also plays a role here. let’s plot the stats against the Country region then.

vis8= sns.lmplot(data = df, x = "Internet users", y = "Birth rate",fit_reg=False,\
hue = "Country Region",\
size = 6, aspect=1.5, scatter_kws = {'s':200}, )
Internet Users vs Birth Rate

As it can be seen here, country region plays an important role on birth rate. In Europe region, the internet usages vary but birth rate is quite same. In the African region most of the countries have very high birth rate and very low internet usage. The most interesting results are obtained in Asia region. Here, results are highly diversified as it was for middle income group. It can be imagined that perhaps most of the population in Asian region belongs to middle income group. So, let’s plot count plot to know the income group distribution across country region.

sns.countplot(y="Income Group", hue="Country Region", data=df);
Income Group

From the above plot, it can be seen that the contribution of Africa is nearly null for High Income Group while it it is maximum for Low Income Group, which is exactly opposite in Europe’s case. The most interesting scenario is of the Asia region. The income group seem to be nearly uniformly distributed.

Now to see the Life Expectancy status across different demography, let’s use the swarm plots.

vis9 = sns.swarmplot(x="LifeExp1960", y="Country Region", hue="Income Group", data=df)
vis9.legend_.remove()
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
Life Expectancy at 1960 vs Country Region

From the above graph, it’s visible that in 1960 income is a big factor for Life Expectancy rate. Higher the income, higher the life expectancy. But Country region also plays a big role. As can be seen in Middle East and Oceania, high income group people also have lower life expectancy. Let’s see the change in trend in 2013.

vis10 = sns.swarmplot(x="LifeExp2013", y="Country Region", hue="Income Group", data=df)
vis10.legend_.remove()
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
Life Expectancy at 2013 vs Country Region

At 2013, the situation seems to be little different from 1960. The life expectancy has been improved in every region especially for high income group and upper middle income group. But African region is still seems to be under privileged.

--

--

Putting ML to Customer Support at CSAT.AI | Natural Language Processing | Full Stack Data Scientist (sambit9238@gmail.com)