The Data Science Behind the New York Times’ Dialect Quiz, Part 1

Audrey Lorberfeld
Towards Data Science
6 min readNov 16, 2018

--

In 2013 the New York Times published Josh Katz’s “How Y’all, Youse and You Guys Talk.” You probably remember taking it, or at least hearing about it. It was the one that asked you things like “What do you call something that is across both streets from you at an intersection?” Answers you could choose included options like “kitty-corner” and “catty-corner” (the latter being the obvious right choice). Everyone I knew was impressed by its accuracy.

After answering 25 questions aimed at teasing out your linguistic idiosyncrasies, you were classified as having grown up in a particular area of the US (technically, the quiz shows you the region where people are most likely to speak like you, so it could ostensibly show you where your parents grew up, rather than where you grew up, as Ryan Graff points out). To my surprise, every time I took the quiz, it classified me as being from some town or another never more than ~15 miles from where I actually grew up. And my experience was not unique – the quiz was the most popular thing the Times put out that year, despite its publication date of December 21. It was such a hit that three years later Katz published a book about it.

Yes, I’m from the Yonkers area.

Besides being a national phenomenon in 2013, why should we care about Katz’s dialect quiz now? I care deeply about it because I am a language- and information science-nerd. But you should care about it because it was a successful attempt at bringing data science into the homes of millions of Americans without regard to technical skill or intellectual capacity.

First things first: a brief history of the quiz.

(much of the following information is based on Katz’s talk at NYC Data Science Academy.)

  • The questions in Katz’s quiz were based on a larger research project called the Harvard Dialect Survey, published in 2003 by Bert Vaux and Scott Golder from Harvard’s Linguistics Department (you can find a good interview with Vaux on NPR here).
  • Vaux and Golder distributed their 122-question quiz online, and it focused on three things: pronunciation, vocabulary, and syntax.
  • The original quiz resulted in about 50k observations, all of which were coded by zip code.
  • Katz authored the Times’ version of this quiz in 2013 as a graduate-student intern during his studies in statistics at North Carolina State University. (He was invited to do the Times internship after they discovered his visualizations of Vaux and Golder’s original data.)
  • The tech involved in the Times quiz includes R and D3, the latter of which is a JavaScript library used for tying data to a page’s DOM for manipulation and analysis, similar to jQuery.

Onto the data science

So how did the quiz actually work? Its foundation was the supervised machine learning algorithm K-Nearest Neighbors (K-NN), which is, as my graduate-school TA told us, a machine learning algorithm used to “predict the class of a new datapoint based on the value of the points around it in parameter space.” We will dive into the idea of machine learning and the ins and outs of the specific K-NN algorithm in a later post. For now, let’s tackle some of the jargon in my TA’s definition.

What is “parameter space”?

According to Wikipedia, parameter space is “the set of all possible combinations of values for all the different parameters contained in a particular mathematical model.” While impressive-sounding, that definition’s not particularly helpful for the layperson. Since I am a visual learner, perhaps a doodle will be more edifying:

Personal doodle.

Essentially, if you have parameters (i.e. arguments or variables) that you can plot, the space in which you plot them is parameter space. For K-NN, parameter space would be everything between the two axes with the point we are trying to classify being the star. (Ignore the k-values for now.)

http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/

In the chart above, there are two types of circles: yellow circles and purple circles. The point of performing K-NN on a dataset like this is to predict whether the star, our new input, will fall into the yellow-circle category or the purple-circle category based on its proximity to the circles around it.

So, parameter space. Check.

There is one more thing we need to tackle before diving into the ideas and math behind K-NN. This term was absent from my TA’s definition above, but understanding it will help us understand what exactly is going on when we run a K-NN analysis., and that term is algorithmic laziness.

https://www.theodysseyonline.com/im-secretly-lazy

K-NN is a “lazy” algorithm.

But how can an algorithm be lazy? Can algorithms get tired? Can they have bad days? Sadly, no. Here, laziness means that an algorithm does not use training data points for any generalization, as Adi Bronshtein writes.

We haven’t yet bridged the idea of training an algorithm, but we can still understand what Bronshtein means. Essentially, all supervised machine learning algorithms need some data off of which to base their predictions. In K-NN’s case, it needs data like the yellow and purple circles in our chart above in order to know how to classify the star. As opposed to eager algorithms (e.g. decision trees), lazy algorithms store all the training data they will need need in order to classify something and don’t use it until the exact moment they‘re given something to classify.

Another term for lazy algorithms that might convey more of their function is “instance-based learning.” As the name connotes, algorithms of this type (generally) take in an instance of data and compare it to all the instances they have in memory.

Cathy O’Neil, a.k.a. “mathbabe,” gives a good example of instance-based learning with a grocery-store scenario:

What you really want, of course, is a way of anticipating the category of a new user before they’ve bought anything, based on what you know about them when they arrive, namely their attributes. So the problem is, given a user’s attributes, what’s your best guess for that user’s category?

Let’s use k-Nearest Neighbors. Let k be 5 and say there’s a new customer named Monica. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in terms of attributes, and sees what categories those 5 customers were in. If 4 of them were “medium spenders” and 1 was “small spender”, then your best guess for Monica is “medium spender”.

Holy shit, that was simple!

Of course, things are never that simple, but we’ll reserve the complexity of K-NN for a later post. For now, K-NN = a lazy algorithm = stores the data it needs to make a classification until it’s asked to make a classification.

And that’s it! Now we have the building blocks to move onto discussing things like training, how exactly K-NN works in practice, and, most importantly, how Katz used it for his dialect quiz. Stay tuned for all of this in Part 2!

In the meantime, I encourage all of you to take the dialect quiz if you haven’t already (and take it again even if you have). You’ll need your answers later!

--

--