Data Visualization using Matplotlib

Badreesh Shetty
Towards Data Science
11 min readNov 12, 2018

--

Data Visualization is an important part of business activities as organizations nowadays collect a huge amount of data. Sensors all over the world are collecting climate data, user data through clicks, car data for prediction of steering wheels etc. All of these data collected hold key insights for businesses and visualizations make these insights easy to interpret.

Data is only as good as it’s presented.

Why are visualizations important?

Visualizations are the easiest way to analyze and absorb information. Visuals help to easily understand the complex problem. They help in identifying patterns, relationships, and outliers in data. It helps in understanding business problems better and quickly. It helps to build a compelling story based on visuals. Insights gathered from the visuals help in building strategies for businesses. It is also a precursor to many high-level data analysis for Exploratory Data Analysis(EDA) and Machine Learning(ML).

Human beings are visual creatures. Countless studies show how our brain is wired for the visual, and processes everything faster when it is through the eye.

“Even if your role does not directly involve the nuts and bolts of data science, it is useful to know what data visualization can do and how it is realized in the real world.”

- Ramie Jacobson

Data visualizations in python can be done via many packages. We’ll be discussing of matplotlib package. It can be used in Python scripts, Jupyter notebook, and web application servers.

Matplotlib

Matplotlib is a 2-D plotting library that helps in visualizing figures. Matplotlib emulates Matlab like graphs and visualizations. Matlab is not free, is difficult to scale and as a programming language is tedious. So, matplotlib in Python is used as it is a robust, free and easy library for data visualization.

Anatomy of Matplotlib Figure

Anatomy of Matpotlib

The figure contains the overall window where plotting happens, contained within the figure are where actual graphs are plotted. Every Axes has an x-axis and y-axis for plotting. And contained within the axes are titles, ticks, labels associated with each axis. An essential figure of matplotlib is that we can more than axes in a figure which helps in building multiple plots, as shown below. In matplotlib, pyplot is used to create figures and change the characteristics of figures.

Installing Matplotlib

Type !pip install matplotlib in the Jupyter Notebook or if it doesn’t work in cmd type conda install -c conda-forge matplotlib . This should work in most cases.

Things to follow

Plotting of Matplotlib is quite easy. Generally, while plotting they follow the same steps in each and every plot. Matplotlib has a module called pyplot which aids in plotting figure. The Jupyter notebook is used for running the plots. We import matplotlib.pyplot as plt for making it call the package module.

  • Importing required libraries and dataset to plot using Pandas pd.read_csv()
  • Extracting important parts for plots using conditions on Pandas Dataframes.
  • plt.plot()for plotting line chart similarly in place of plot other functions are used for plotting. All plotting functions require data and it is provided in the function through parameters.
  • plot.xlabel , plt.ylabel for labeling x and y-axis respectively.
  • plt.xticks , plt.yticks for labeling x and y-axis observation tick points respectively.
  • plt.legend() for signifying the observation variables.
  • plt.title() for setting the title of the plot.
  • plot.show() for displaying the plot.

Histogram

A histogram takes in a series of data and divides the data into a number of bins. It then plots the frequency data points in each bin (i.e. the interval of points). It is useful in understanding the count of data ranges.

When to use: We should use histogram when we need the count of the variable in a plot.

eg: Number of particular games sold in a store.

From above we can see the histogram for GrandCanyon visitors in years. plt.hist() takes the first argument as numeric data in the horizontal axis i.e GrandCanyon visitor.bins=10 is used to create 10 bins between values of visitors in GrandCanyon.

From above, we can see the components that make a histogram, n as the max values in each bin of histogram i.e 5,9, and so on.

The cumulative property gives us the end added value and helps us understand the increase in value at each bin.

Range helps us in understanding value distribution between specified values.

Multiple histograms are useful in understanding the distribution between 2 entity variables. We can see that GrandCanyon has comparably more visitors than BryceCanyon.

Implementation: Histogram

Pie Chart

It is a circular plot which is divided into slices to illustrate numerical proportion. The slice of a pie chart is to show the proportion of parts out of a whole.

When to use: Pie chart should be used seldom used as It is difficult to compare sections of the chart. Bar plot is used instead as comparing sections is easy.

eg: Market share in Films.

Note: Pie Charts is not a good chart to illustrate information.

Above, plt.pie() takes the numeric data as 1st argument i.e Percentage and labels to display as second argument i.e Sector. Ultimately, it shows the distribution of data in proportion to the pie.

From above we can the components that make a pie chart and it returns wedge object, text in labels and so on.

A pie chart can be easily customized and from above color and label values are formatted.

From above explode is used to separate out points from the pie. Similar to a pizza piece being cut.

Implementation: Pie Chart

Time Series by line plot

Time series is a line plot and it is basically connecting data points with a straight line. It is useful in understanding the trend over time. It can explain the correlation between points by the trend. An upward trend means positive correlation and downward trend means a negative correlation. It mostly used in forecasting, monitoring models.

When to use: Time Series should be used when single or multiple variables are to be plotted over time.

eg: Stock Market Analysis of Companies, Weather Forecasting.

First, Convert Date to pandas DateTime for easier plotting of data.

From above, fig.add_axes is used for plotting the canvas. Check this What are the differences between add_axes and add_subplot? to understand axes and subplots. plt.plot() takes the 1st argument as numeric data i.e Date and 2nd argument is to numeric stock data. AAPL Stock is considered as ax1 which is the outer figure and on ax2 IBM Stock is considered for plotting which is inset.

In the earlier figure,add_axes is used to used to add an axes to a figure whereas from above add_subplot adds multiple subplots to a figure. fig.add_subplot(237) cannot be done as there are only 6 subplots possible.

We can see that the tech company stocks are following an upward trend showing positive results for traders to invest in stocks.

Implementation: Time Series

Boxplot and Violinplot

Boxplot

Boxplot gives a nice summary of the data. It helps in understanding our distribution better.

When to use: It should be used when we require to use the overall statistical information on the distribution of the data. It can be used to detect outliers in the data.

eg: Credit Score of Customer. We can get the max, min and much more information about the mark.

Understanding Boxplot

Source: How to Read and Use a Box-and-Whisker Plot

From the above diagram, the line that divides the box into 2 parts represents the median of the data. The end of the box shows the upper quartile(75%)and the start of the box represents the lower quartile(25%). Upper Quartile is also called 3rd quartile and similarly, Lower Quartile is also called as 1st quartile. The region between lower quartile and the upper quartile is called as Inter Quartile Range(IQR) and it is used to approximate the 50% spread in the middle data(75–25=50%). The maximum is the highest value in data, similarly minimum is the lowest value in data, it is also called as caps. The points outside the boxes and between the maximum and maximum are called as whiskers, they show the range of values in data. The extreme points are outliers to the data. A commonly used rule is that a value is an outlier if it’s less than lower quartile-1.5 * IQR or high than the upper quartile + 1.5* IQR.

bp contains the boxplot components like boxes, whiskers, medians, caps. Seaborn another plotting library makes it easier to build custom plots than matplotlib. patch_artist makes the customization possible. notchmakes the median look more prominent.

A caveat of using boxplot is the number of observations in the unique value is not defined, Jitter Plot in Seaborn can overcome this caveat or Violinplot is also useful

Violin plot

Violin plot is a better chart than boxplot as it gives a much broader understanding of the distribution. It resembles a violin and dense areas point the more distribution of data otherwise hidden by box plots

When to use: Its an extension to boxplot. It should be used when we require a better intuitive understanding of data.

The density of points in the middle seems more as students tend to score around average mostly in the subjects.

Implementation: Boxplot & Violinplot

TwinAxis

TwinAxis helps in visualizing plotting 2 plots w.r.t to the y-axis and same x-axis.

When to use: It should when we require 2 plots or grouped data in the same direction.

Eg: Population, GDP data in the same x-axis (Date).

Plotting 2 Plots w.r.t the y-axis and same x-axis

Extracting important details i.e Date for the x-axis, TempAvgF, and WindAvgMPH for the different y-axis.

As we can there is only 1 axis,twinx() is used for twinning the x-axis and left y-axis is used for Temp and the right y-axis is used for WindMPH.

Plotting the same data in different units and the same x-axis

The function is defined for calculating different unit of data i.e convert from Fahrenheit to Celsius.

We can see that to the left y-axis Temp in Fahrenheit is plotted and to the right x-axis Temp in Celsius is plotted.

Implementation: TwinAxis

Stack Plot and Stem Plot

Stack Plot

Stack plot visualizes data in stacks and shows the distribution of data over time.

When to use: It is used for checking multiple variable area plots in a single plot.

Eg: It is useful in understanding the change of distribution in multiple variables over an interval.

As stack plot requires stacking, it is done in using np.vstack()

plt.stackplot takes in 1st argument numeric data i.e year and 2nd argument the vertically stacked data i.e the Nationalparks.

Percentage Stacked plot

Similar to stack plot but each data is converted into a percentage of distribution it holds.

data_prec is used to divide the overall percentage into individual percentage distributions. s= np_data.sum(axis=1) calculates sum along columns, np_data.divide(s,axis=0) divides data along rows.

Stem Plot

Stemplot even takes negative values, so the difference is taken of data and is plotted over time.

When to use: It is similar to a stack plot but the difference helps in comparing the data points.

diff() is used to find the difference between previous data and is stored in another copy of the data. The first data point is NaN (Not a Number) as it doesn’t contain any previous data for calculating the difference.

(31n)Subplots are created to accommodate 3 rows 1 column subplots in the figure. plt.stem() takes the 1st argument as numeric data i.e year and 2nd argument as numeric data of the National Park visitors.

Implementation: Stack Plot & Stem Plot

Bar Plot

Bar Plot shows the distribution of data over several groups. It is commonly confused with a histogram which only takes numerical data for plotting. It helps in comparing multiple numeric values.

When to use: It is used when to compare between several groups.

Eg: Student marks in an exam.

plt.bar() takes the 1st argument as labels in numeric format and 2nd argument for the value it represents w.r.t to the plots.

Implementation: Bar Plot

Scatter Plot

Scatter plot helps in visualizing 2 numeric variables. It helps in identifying the relationship of the data with each variable i.e correlation or trend patterns. It also helps in detecting outliers in the plot.

When to use: It is used in Machine learning concepts like regression, where x and y are continuous variables. It is also used in clustering scatters or outlier detection.

plt.scatter() takes 2 numeric arguments for scattering data points in the plot. It is similar to line plot except without the connected straight lines. By corr we mean correlation and it means that how correlated GDP is with life expectancy, as we can see that it is positive it means as GDP of a country increases, life expectancy too increases.

By taking the log of GDP, we can there is a much better correlation as we can fit points better, it converts GDP in log scale i.e log($1000)=3.

3D Scatterplot

3D Scatterplot helps in visualizing 3 numerical variables in a three- dimensional plot.

It is similar to scatter except we add 3 numerical variables this time. By looking at the plot we can make an inference that as the year and GDP increases, life expectancy too increases.

Implementation: Scatter Plot

Find the above code in this Github Repo.

Conclusion

In summary, we learned how to build data visualization plots using one numeric variable and multiple variables. We can now easily build plots for understanding our data intuitively through visualizations.

--

--