By Huey Fern Tay
With Greg Page

To infer is to arrive at a conclusion based on evidence and reasoning.
In literature, this is done through contextual clues. Figures of speech such as "…you have a February face, so full of frost, of storm and cloudiness…" tell us that a person is gloomy because February is still a cold time of year in the northern hemisphere.
Similarly, in data science, intelligent guesses can be made based on our domain knowledge, linear modelling, or just common sense.
To illustrate this, I will use the example of a fictional amusement park in Maine called Lobster Land.
Suppose that you are working at Lobster Land as the resident data analyst. One day, the boss asks you to "fix the problem" after discovering some entries missing in a spreadsheet. With only a data dump in hand, and without knowing the context, where would you start?
Step 1: Determine the severity of the missing data problem.
We begin by installing the visdat() package in R. Applying the _vismiss() function to the raw data enables you to visualise the holes in your dataset.
The chart below shows holes within two categories: precipitation and GoldZoneRev. The precipitation variable indicates the rainfall for a particular day, measured in inches. The GoldZoneRev variable, measured in dollars, indicates the revenue generated by Lobster Land’s indoor gaming area, where people win gold tickets that they can use to exchange for gifts.
Luckily, the gaps within GoldZoneRev only affect 5.66% of that variable’s data. In addition, the chart below tells us 3.77% of the data within the ‘precipitation’ column is missing.

Step 2: Determine which factor(s) influence the revenue at the Gold Zone
As I have written in a previous article, Last Observation Carried Forward (LOCF) is one method we can use to treat missing values. It can work well in cases for which a previous observation is likely to offer an important clue about a missing value (as with daily weather data in a place with a seasonally changing climate). However, in this case, LOCF is not well-suited for the challenge at hand, as the revenue of an amusement park attraction can be influenced by several factors. The day of the week might matter, as might the overall volume of park visitors on a particular day. There is no reason for us to assume that we can make a reasonable Imputation for this variable by simply using the previous day’s total.
A correlation matrix will tell us whether there is a relationship between other variables in the dataset, as well as the strength of those relationships.
To create the matrix, we use the cor() function. At the same time, we tell R to only use data that are classified as ‘numeric’ while ignoring the Missing Values. These are necessary instructions to prevent the command from failing. Finally, we round the results to two decimal places.
In the matrix shown below, the sizes of the circles correspond to the strength of the correlations. Darker shades of blue indicate positive correlations, whereas darker shades of red indicate negative ones. In this case, the number of day passes sold, the revenue from Snack Shack, and the number of hours Lobster Land staff work are three of the strongest factors that influence the day’s takings at the Gold Zone.

Step 3: Estimate the day’s takings at the Gold Zone based on its relationship with one or more variables
For the sake of simplicity in this example, we will try to estimate the revenue at the Gold Zone for those missing days based on one factor – the number of day passes sold.
The correlation matrix tells us the more day passes are sold at Lobster Land, the more money we make at the Gold Zone. In other words, there is a strong, positive, and linear relationship between those two factors (see chart below).

Since our missing values for GoldZoneRev lie within the range of our existing data, we can infer the day’s takings by looking at how many day passes were sold on the day. To do this, we use the _imputelm() function within R’s simputation package.
To make an intelligent guess about the day’s takings based on the number of day passes sold, we can run the following code:
In the code above, we first tell R to look within the ‘rawdata’ dataset. Then we add a column called _ismissing to keep track of the cells that were originally blank. Next, we call the _imputelm() function. Within that function, we express GoldZoneRev as a function of DayPass.
Step 4: Verify the results
Day 13 is the first instance where the total daily revenue for Gold Zone was missing.
Our linear model tells us that the day’s revenue was $30914 on a day when 4151 day passes were sold. That estimate makes sense because it is only slightly lower than our takings on Day 7 when 4412 day passes were sold, and $33163 was generated at the Gold Zone.

So that leaves us with two questions. Could we have predicted Gold Zone’s missing revenue based on more than one factor? Yes, but this would have violated the parsimony principle – because the linear relationship between DayPass and GoldZoneRev was so strong, there was no need to introduce additional complications by adding other input variables.
Lastly, could we have used a similar linear model to predict missing values in the ‘precipitation’ column? The answer is no. Rain does not depend on the day of the week, the number of lobster rolls sold, or any other variables present in our dataset, so there would not be any suitable use of _imputelm() to fill those holes in our dataset.