This problem was posed to apprentices of mine when getting used to several concepts in programming, statistics, and data manipulation. We will be using a simplified version of the problem that uses my own data and the data will already be cleaned up. I will discuss the original problem as well, but the examples here will only cover the portions that involved scaling the data and determining the nearest guess.
Baby guessing games (where the participants write down their best guess as to when the child will be born and how much he will weigh) are quite common. Since we are dealing with two variables of differing units, it’s often hard to determine who had the best guess overall. This makes scaling a requirement before considering any measure of distance. For a refresher on feature scaling, check out this article.
Table of Contents
The Original Problem
The original problem posed to the apprentices was as follows:
- We have two colleagues expecting babies in the next two months.
- Everyone on the team has written in their guess for date-of-birth (DOB) and birth weight for the two pregnancies. These guesses are in our internal slack channel and are in free-text form. So the slack API had to be used to get the data.
-
The original problem involved regex and had to deal with identifying:
- Which messages were related to this problem
- Some messages had guesses for both babies (by name) so they had to identify which values were associated with which child. In other cases a separate post was made for each guess.
- There were various date and weight formats used, each had to be accounted for. (6.3 lbs | 6 lb 2 oz | etc)
- We wanted to show who had the closest weight guess, closest date guess, and, most importantly, who was closest overall.
The Simplified Problem
In our case, we will be using the data from my own child’s guessing game. The data will already be fairly clean so we will be focusing on the second half of the problem.
Here we have the individuals that made guesses, the date/time of the guess, and the weight in pounds and ounces. Feel free to download the CSV for use in the following code.
We will be attempting to determine who had the closest guess using both the date/time and the weight. To do this, we will need to scale the data properly and potentially visualize this data so those participating can see why the winner was chosen.
Getting Started
I recommend using RStudio as your IDE.
You can run the following code to install/load the necessary libraries.
The following code will create a popup choose-file dialog box. Choose the downloaded CSV from above to continue.
We have a minor amount of data preparation required for this dataset. The CSV file will have the Date
field stored as a string and the Weight_Oz
field will have NA for the missing values rather than zero. We will also combine the pounds and ounces to one single guessWeight
field (in pounds)
Now we need to set the actual values for the baby’s birth date/time and birth weight.
Distance Measures
Next, we will prepare the data by calculating the differences in the guesses and the actual values. We will be calculating distances in hours and pounds.
To consider distances across both variables, we need to scale the data. We will be using a form of standardization in this case.
Traditionally standardization takes on the following form:

We will not be using this version. Since we are concerned with the distance from a particular value, we will instead divide our absolute distances by the standard deviation to yield the following:

We will discuss why this is preferred in our scenario when we get to the visualization of the results.
And finally, we can add a field that calculates the distance (Euclidean) for these two points from our original value in this new scale.
Note: The values of this distance are not relevant, only the relative nature of the values to one another. Basically, we take the minimum of these to determine which is closest.
Now we can take a moment to view the data with each of these fields.
# lets take a look at the dataframe arranged by the relative guess distance
View(babyGuessDf %>% arrange(guessDist))
From this, we see the following:
- Closest time: "Aunt_1"
- Closest weight: "Grandma_1"
- Best guess overall: "Grandma_1"
At this point, we are essentially done with the problem. However, it would be nice to justify these results with a visual.
Visualization
Let’s take a look at the data in the standardized scale and using the absolute values. The modified version of standardization we used allows us to set the actual values as 0 and saves us several steps.
To make this visual we will be adding a row for the actual value. We will also be using coord_equal()
to prevent ggplot from normalizing the view. (Not too big of a deal with standardized data, but it can introduce misconceptions when thinking about scaled distances)

If we were to instead have used the traditional calculation for standardization, we would need to apply the same transformation to the actual values in order to get the distance.
Here is the traditional standardization. Note: This visual doesn’t use the absolute values, the distances would be calculated after this step so the absolute values would not be necessary.

And to tie all this together we can take a look at the data in the original scale. Note: ggplot is normalizing the window, so this view is as if we normalized the data 0–1 for each feature.

I hope you enjoyed this short guided example and perhaps you learned something new or found some benefit from the practice. There’s no machine learning here, but a mastery of these concepts lay a good foundation in analytics and dealing with Euclidean distance-based algorithms.
I’d love any feedback if you took the time to go through the problem. Take care.