The Data Science Behind the New York Times’ Dialect Quiz, Part 2

Published in

Towards Data Science

9 min readDec 13, 2018

The first part of this series explored Josh Katz’s NYT Dialect Quiz and touched on some data science topics such as parameter space and “lazy” algorithms. In this part, we will start by defining some common machine learning (ML) terms and move onto exploring how Katz might have ML in the Quiz.

One caveat: I am a newcomer to Data Science. If you see any inaccuracies in the following post, please comment and let me know. I love to deepen my understanding of the field. Among other thing, this writeup is simply a high-level attempt at reverse-engineering a data-science project I love. (I have attempted to reach Josh Katz and find the code for his project, but have been unsuccessful.)

Some Additional Dialect-Quiz Background

From listening to Katz’s talk at NYC Data Science Academy and reading his interview with Ryan Graff, I have gathered the following:

Katz created a 142-question pilot dialect quiz that had the original 122 questions that Vaux and Golder used in their survey, plus 20 more that Katz came up with via input from the RStudio community (the same community to which he posted his original dialect map-visualizations, which got him noticed by the Times).
In addition to answering these 142 questions, users could select an answer of “other” for each question and write in custom responses.
In the pilot study, in addition to linguistic and locational questions, Katz also surveyed people on age and gender.
In total, 350k people participated in the pilot quiz.
The data from the pilot is what Katz used to build his final model for the NYT version.
Katz pared down the questions for the final Dialect Quiz from 142 to 35 (only 25 of which are fed to a user in a single session, making the quiz slightly different each time someone takes it) based on which ones he found most revealing.

Supervised v. Unsupervised ML

As mentioned in the first part of this series, K-Nearest Neighbors (K-NN), the algorithm that Katz used in his dialect quiz, is a supervised ML algorithm. This means that K-NN learns how to do its job by being fed data that has both questions and answers. In contrast to unsupervised ML, K-NN, and algorithms like it, are given a set of problems along with their solutions so that they can easily see what type of output is expected of them in the future.

Claudio Masolo does a great job describing the differences between the two types of ML in his blog post “Supervised, Unsupervised, and Deep Learning”:

With supervised learning, a set of examples, the training set, is submitted as input to the system during the training phase. Each input is labeled with a desired output value, in this way the system knows [how the output should be, depending on the input] . . . [In u]nsupervised learning, on the other hand, the training examples provide by the system are not labelled with the belonging class. So the system develops and organizes the data, searching common characteristics among them, and changing based on internal knowledg

Supervised learning schema from Masolo’s post “Supervised, Unsupervised, and Deep Learning”

Unsupervised learning schema from Masolo’s post “Supervised, Unsupervised, and Deep Learning”

In summary:

Supervised ML (e.g. K-NN) = feeding your model data containing questions and answers so that it can make accurate predictions.
Unsupervised ML = feeding your model data containing questions and asking it to tease out patterns from those questions that it can then use to make accurate predictions.

More jargon!

So, we know that K-NN is a supervised ML algorithm, and now we know what that means. Before we move on, there’s just a bit more jargon we have to tackle.

https://instapage.com/blog/avoiding-bad-agency-proposals

In the world of ML, data scientists train our algorithms by feeding them training data (usually 80% of our dataset). This training data consists of things called feature vectors and labels. Let’s go over both of these concepts in addition to a couple more terms to jumpstart our understanding.

Features

Features are essentially your dataset’s column headers. They are your independent variables, the change in any one of which may or may not result in a subsequent change in your dependent variable (or “target”). (If you’ve ever heard of a data scientist doing “feature engineering” these are the things they’re adding/deleting in order to optimize their model.)

In the case of the Dialect Quiz, some features were likely the different questions in the quiz, age, and sex.

Feature Vectors

Feature vectors are essentially your dataset’s rows. Each feature vector is a numerical representation of the observations in your dataset (yay, linear algebra!).

Let’s say you’re creating a dataset of housing prices. Each row represents a single house’s pricing information. While your features will be things like “asking_price,” “num_bedrooms,” and “selling_price,” a feature vector would be something like “1,200,000, 3.0, 950,000.” That feature vector would represent the asking price ($1.2MM), the number of bedrooms (3), and selling price ($950k) for a specific house. In K-NN, feature vectors are usually locations of data points (longitude, latitude) and their Euclidean distances from each other.

In the Dialect Quiz, feature vectors would likely be numerical representations of a person’s answers to each question in the quiz, their age, and their gender (in addition to locational data).

Labels

Labels are your dependent variable(s). They are the targets or classes you are trying to predict. For the Dialect Quiz, these would likely be geographic locations.

(Labels are technically part of your dataset’s features. However, during the feature engineering phase and something called train-test-split phase, you would take it/them out to isolate the relationships between your dependent and independent variables.)

Finally, Let’s Get To It

Okay, now that we are all on the same page with some key ML terms, we can start exploring how Katz might have actually used K-NN to produce his Dialect Quiz. From his Data Science Academy talk, I was able to find the following slide.

Now, there is a lot going on in this slide, most of which will have to remain a bit abstract without being able to explore his code.

Going from the top, there are two chunks of information I’ll (briefly) explore in the remainder of this post:

Choosing a K-value
Kernel Smoothing

So, onto the madness!

Choosing a K-Value

On the slide above, Katz seems to be saying that he wants to use the difference (the “proportions) between his chosen k-value and a point’s k nearest neighbors (t) in order to estimate the probability of a person being from some location. So, how does one choose this elusive k value, and what is k?

k is a “hyperparameter,” which is just a fancy word for some attribute of your model that you can tweak independently of your data. For example, if you were a professional speed-hotdog-eater, a hyperparameter you might care about would be the number of hotdogs you put in your mouth at once. Maybe you want to try eating 3 at a time, maybe 1 at a time. You can change this overarching attribute of your strategy independent of whether you are eating Hebrew National hotdogs or Gray’s Papaya hotdogs.

In K-NN, your hyperparameter k is basically the number of nearest neighbors you want your model to care about. In practice, different k-values result in different “decision boundaries.” It’s up to you to pick the best one for your data. Below is an illustration from Kevin Zakka’s great blog post on K-NN showing what different k-values might look like.

As Zakka notes, a smaller k-value (e.g. 1) will result in a more flexible fit (with low bias, but high variance, the balance of which we don’t have time to get into in this post). A larger k-value (e.g. 20) would result a smoother boundary because it’s better at resisting outliers (with high bias and low variance).

One strategy that Katz might have used to land on his optimal k-value is cross-validation. Cross-validation is a way to estimate generalization errors. With cross-validation and K-NN, you divide your sample data into random segments and apply your K-NN model to each segment with varying values for k. You then analyze each segment’s SSR or its “sum of squares due to regression.” In regression-speak, SSR represents the movement you get when you go from the actual mean of your data to your predicted value. After you run your K-NN with varying values for k over your segmented data, each SSR is averaged, and you pick the k-value that yielded the smallest error.

Flatiron School slide from Linear Regression deck created by Sean-Abu Wilson.

Whatever value you pick for k will change the output of your model.

While we do not know exactly which k Katz chose for his model, we can at least understand its importance.

Kernel Smoothing

Katz also seems to have used something called a kernel smoother in his model. Wikipedia tells us that a kernel smoother is a

statistical technique to estimate a real valued function as the weighted average of neighboring observed data. The weight is defined by the kernel, such that closer points are given higher weights.

So, kernels help us weight our k-values in our K-NN model.

In their paper “On Kernel Difference-Weighted K-Nearest Neighbors Classification” Wangmeng Zuo et al write that weighted metrics (k in our case) are “defined as a distance between an unclassified sample x and a training sample x…” (248). This seems very similar to what Katz described on the slide we looked at before where his strategy was to use the proportion of the difference between an unknown value t and its nearest neighbors, so it seems we are on the right track.

In their paper, Zuo et al mention another paper that might help us: “Learning Weighted Metrics to Minimizing Nearest-Neighbor Classification Error.” In this paper, Roberto Paredes and Enrique Vidal discuss their strategy of optimizing their K-NN model by minimizing the “leave-one-out” (LOOCV) classification error cross-validation technique.

On another slide that I found from his NYC Data Science Academy talk, Katz specifically states that he used LOOCV to choose the parameters of his model, one of which was k. While we won’t go into the sigma and alpha parameters in this post, let’s briefly touch on LOOCV to get an idea of what’s going on here.

Slide from Katz’s NYC Data Science Academy talk

In LOOCV, you pick 1 data point to test. This is the point that is “left out.” You then build your K-NN model without this 1 data point and evaluate the error value between your model and this left-out data point. You repeat this for all of your training data points and average the results. LOOCV is known to be very “computationally-expensive” because you have to create and run so many models. Because of this, as you can see on the slide above, Katz chose to limit his possible k-values to groups of 20.

So, by employing cross-validation strategies like weighted kernels and LOOCV, Katz arrived at his perfect k-value.

That was a lot of information. As I said before, please reach out in the comments of this post with any corrections, clarifications, or additional resources I should explore to deepen my understanding of K-NN and ML.

Before we conclude, a few last things to know about the Dialect quiz:

The mini-heat-maps that appear beside each question as you go through the Dialect Quiz are simply static images that were pre-rendered in R.
Unlike the mini-maps, the large map a user gets at the end of the Quiz is produced dynamically in D3.js because the combinations of answers for each question is beyond the number Katz could produce beforehand.
The model calculations were done server-side – after each calculation is made, the server produces a vector representing each answer. On that, vector matrix multiplication is done, and then the server spits back values for each of the answers, which are then rendered on the user’s screen.
Katz’s largely built the quiz in an R package called Shiny.

We did it! We now know most of what’s going on in the famous NYT Dialect Quiz.