The world’s leading publication for data science, AI, and ML professionals.

How to Predict and Visualize Data in one Chart

A Step-by-Step Tutorial How to Use R and ggplot2 along with Linear Regression

Photo by Tim Bogdanov on Unsplash
Photo by Tim Bogdanov on Unsplash

TUTORIAL – PREDICTION – R

In one of my last projects, I was asked to perform a simple linear regression to foresee possible price developments. To compare the actual price development, we used the consumer price index as a baseline. This article will show you how I tried to achieve this with a different data set – using ggplot2 for plotting and linear regression for prediction.

1. Setup

I will briefly explain my setup, including the data and R packages that I am using.

Packages

In general, I always use the Tidyverse package. It includes the packages like ggplot2 to create beautiful plots in a very intuitive way, dplyr that makes data manipulation so straightforward, and so much more.

Additionally, I make use of the ggthemes packages that allow for more out-of-the-box themes for your plots.

Oktoberfest Beer Price and Inflation Index Data

As the leading price index in focus, I use available data from the Oktoberfest, the world’s largest beer festival. The data contains not only beer price information but also numbers of visitors, chicken prices, and beer sold.

Table created by the author (limited to three rows)
Table created by the author (limited to three rows)

I will use available information about the consumer price index in Germany to compare the beer price development.

"The Consumer Price Index (CPI) is a way of measuring the overall price level of the consumer goods and services in the economy." – Source Wikipedia.

The data itself comes from the Database of the Federal Statistical Office of Germany. Please note that Verbraucherpreisindex (VPI) is German for Consumer Price Index (CPI).

Consumer price index with base year in 1991 aligned with beer price; Table created by the author (limited to three rows)
Consumer price index with base year in 1991 aligned with beer price; Table created by the author (limited to three rows)

2. My Workflow to Predict and Visualize Price Developments

Join Beer Price and VDI Data

Create Linear Regression Models

Beer price predition; Table created by the author (limited to three rows)
Beer price predition; Table created by the author (limited to three rows)
Consumer price prediction; Table created by author (limited to three rows)
Consumer price prediction; Table created by author (limited to three rows)

Join all Datasets

Joint data set (original data, predicted data); Table created by author (limited to three rows)
Joint data set (original data, predicted data); Table created by author (limited to three rows)

Create a Line Graph Plot including Predictions and Confidence Levels

Oktoberfest beer price development and prediction; Image created by the author
Oktoberfest beer price development and prediction; Image created by the author
Oktoberfest beer price development and prediction; Image created by the author from ggplot2 result
Oktoberfest beer price development and prediction; Image created by the author from ggplot2 result

Conclusion

This article showed you how I visualized price developments and how I incorporated price predictions using linear regression models. When I look at the result, I see two significant limitations.

Firstly, I am aware that linear regression is not the best predictor for prices in general, even though it might be justified in this case (i.e., predicting yearly Oktoberfest beer price changes). Even though I included the margin of error in the visualization, it does not account for uncertainty increases for each future year. And this is not reflected in this model. I think it is worthwhile to look into the Holt-Winters forecasting method and time series in general to encounter this. Holt-Winters also takes into account seasonality.

Secondly, creating a reproducible chart by including text annotations makes the code quite cluttered and hard to maintain (or even to write in the first place). Also, I am unsure whether the plot itself is unbalanced and cluttered with all the text annotations.

What do you think I could do to improve this solution?

Please feel free to contact me with any questions and comments. Thank you. Find more articles from me here:

  1. Learn how to create beautiful art from your personal data
  2. Learn how I got and analyzed my Garmin running data
  3. Learn how to setup logging for your Python code
  4. Learn how to write clean code in Python using chaining (or pipes)

Related Articles