Does Autocorrect Make Life Better?

A cautionary tale of systemic machine learning failure

John Hawkins
Towards Data Science
6 min readOct 8, 2022

--

Photo by Laura Rivera on Unsplash

One of the potential benefits of applying data science to many products and businesses is the promise of reduced friction and inconvenience in our everyday lives. The idea is that crafted machine learning models are embedded in all the devices and services we use. They will tirelessly toil to remove all manner of irritations and burdens from our lives as we become ever more free to focus on what matters in life.

Is this just an overly optimistic pipe dream?

If we are ever going to realise the potential of these technologies we need to take stock of the many small ways that machine learning fails us in everyday life. We could go on and curate a list of things like racist image classifiers, sexist recruitment tools, or the many forms of psychopathy that can manifest in chatbots. Instead, let’s focus on a more mundane and widespread form of machine learning failure that effects minorities and majorities alike: autocorrection.

Autocorrection is a simple form of digital assistance. You type something, the machine recognises that it is not a word, and so it changes it to what it thinks you wanted to type. These systems are embedded into our phones, both in our operating systems and sometimes in specific apps on the phone. Some versions are just basic statistical models of word similarity and frequency, other employ machine learning and consider the other words in the sentence. Their purpose, on the face of it, is clear; we want to remove typos from the text we write.

I write “Wutocoreect” and the device changes it to “Autocorrect

I write “Gailire” and the device swoops in and changes it to “Failure”

A problem can emerge when we look at a correction that occurs to a critical word in a sentence.

I type “Wht doya need?” and autocorrect changes it to “Why do ya need?”[1]

All of a sudden my attempt to ask a question that requests clarification or instructions, becomes a pushback for justification. The entire sense of the sentence changes, with an accompanying potential for a negative emotional interpretation. To add insult to injury the original text, complete with its misspellings, is perfectly comprehensible. This last fact is common for many different typos and is perfectly demonstrated by the common practice of disemvoweling words in text messages.

It is worthwhile pausing and reflecting on that last point. The autocorrection feature happily rolled out into my relatively modern smartphone is correcting words in a way that can change the meaning of the sentence. It does this even under circumstances where we have evidence that in most instances the worst we can expect from misspellings is slower reading time [2].

This is a technology failure.

Rather than providing me with utility, this sophisticated software function is actively getting in the way of communication. How can this be? If we are to move forward in our deployment of data science in the world we should thoroughly understand how such a mundane task can result in a product that produces negative results.

The fundamental cause is that when these models are built they are evaluated using metrics that are disconnected from the impact on end users. In an ideal world we would consider how any changes to our writing would affect the readability, and comprehension, of what we write. But getting a dataset that allows a machine learning developer to evaluate that end goal is hard. It is much easier to just collect some data about common ways specific words are mistyped and evaluate them using standard metrics that describe proportions and ratios of words that are correctly versus incorrectly modified (for examples see [3]). To be fair, these models can be used in situations, like correcting the content of search queries, that are less sensitive to communication mishaps. More recent academic work on the subject of evaluating autocorrection methods emphasises the importance of the context of the words[4] and comprehensibility of the text[5]. Nevertheless, they all stop short of making the expected impact on comprehension the central focus of evaluation.

This is how machine learning projects add to our burdens. They get built by people who are either disconnected from the end users, are overwhelmed by the complexity of what end users want, or do not have the time or resources to evaluate models using data that reflects real world usage. So they simplify. They build something that can perform a well qualified and measurable task, and assume it is a small step in the right direction. Sometimes that works, and sometimes it doesn’t. When it doesn’t we get lumped with a technology that makes our lives subtly worse, even though it might seem like an improvement at first.

Ideally an evaluation of any text modifying model would weight words by their importance to sentence comprehension, or use heuristics that severely penalise models that return the wrong word when only a vowel is missing. It is not clear what the perfect evaluation would be, but it is worthy of investigation, because human communication is far more than just a large distributed spelling bee.

If the process of technology stopped with each individual model, then the situation would not be that bad. Poorly designed systems would be replaced by better ones over time. Unfortunately, there are other more complicated historical processes in technological development. Suboptimal decisions can become fixed in place by later development.

Let’s consider the case of the Swypo.

A friend of mine recently introduced me to the term swypo, referring to incorrect words in messages that are created when using the touch screen swipe interface to draw letters. Part of the problem is that the interface has to interpret the intended letter. He attempted to send me the message “I’ll want to tell you in person” and instead I received “I’ll take to hell you in person.”

It appears that the autocorrection model obsession with perfect spelling is now affecting a second layer of technology. The swiping interface used by my friend tries to generate sequences of correctly spelt words. While doing so it creates syntactically awkward sentences that are so far from the original intention that they are contributing to a new form of comedy [6].

This is how machine learning failure becomes a systemic problem. Initial shortcuts are taken that seem reasonable and result in models that provides the surface appearance of utility, but create a fine layer of frustration and inefficiency. Those approaches and their inherent problems become fixed in place by the subsequent layers of technology that are built on top. Gradually poor, rushed or suboptimal decisions become the bedrock of our devices. This process is not new, the history of technology contains many examples, the qwerty keyboard being the most commonly cited. But with machine learning, this technological path dependence promises to accelerate. Machine learning models are less visible, and often less well understood layers of technology. Shortcuts in development and suboptimal design choices will aggregate to create a world of subtle systemic failures.

How can we avoid this?

Here is a test. If you are a data scientist or developer creating a machine learning model you should be very clear about how you will choose the model to deploy. If your selection criteria is based on some kind of standard ML metric (like RMSE) then you should ask yourself how one unit of reduction in that metric will affect the business process or users of that model. If you cannot provide a clear answer to that question then you are potentially not solving the problem at all. You should return to the stakeholders and try and understand exactly how the model is going to be used, and then devise an evaluation metric that estimates real world impact.

You might still optimise something like RMSE, but you will be choosing a model based on how it will affect people, and you might even discover that your model adds no value at all. In that case the best service you can do to society is to convince the stakeholders not to deploy until an improved model is developed.

[1] Example generated in the SMS app of a Google Pixel 4 Smartphone.

[2] Keith Rayner, Sarah J. White, Rebecca L. Johnson, and Simon P. Liversedge, Raeding Wrods With Jubmled Lettres There Is a Cost, 2006,
Psychological Science, 17(3), 192–193

[3] Peter Norvig, How to Write a Spelling Corrector (2007)

[4] Daniel Jurafsky & James H. Martin. Spelling Correction and the
Noisy Channel (2021) https://web.stanford.edu/~jurafsky/slp3/B.pdf

[5] Hládek D, Staš J, Pleva M. Survey of Automatic Spelling Correction. (2020); Electronics. 9(10):1670. https://doi.org/10.3390/electronics9101670

[6] Many examples are collected here https://www.damnyouautocorrect.com/

--

--