Are You Unwittingly Helping to Train Google’s AI Models?

How Google is using your reCAPTCHA entries to train machine learning models

Photo by Rajeshwar Bachu on Unsplash

Google’s reCAPTCHA service is marketed as a means to protect websites from bots. If the system suspects a bot is trying to access a site, it will put up some test that only humans should be able to pass. If you spend enough time on the internet you will have seen a version of this service before. A panel of images comes up and you have to select all the images that contain a fire hydrant, or a car or bridge. We’ve all encountered this system before. If you have interacted with this system before while trying to get access to your favourite website, congratulations you have contributed to some Google machine learning model by labelling some data for them. Deep inside Google’s reCAPTCHA webpages, this is what the company says about the use of data captured from this system:

reCAPTCHA also makes positive use of the human effort spent in solving CAPTCHAs by using the solutions to digitize text, annotate images, and build machine-learning datasets. This in turn helps preserve books, improve maps, and solve hard AI problems.

Let’s take a look at how Google are doing this, speculate about the models we are helping to improve, and what I make of this system where people are unwittingly training some Alphabet Inc artificial intelligence model.

Quick overview of supervised machine learning

Photo by Andy Kelly on Unsplash

In a nutshell, supervised machine learning models are attempting to classify data based on the learning of patterns, or features, that characterise the different classes. To do this, a supervised machine learning model is supplied with a lot of labelled data, called training data. Labelled data is data that comes with a tag identifying the class. A supervised ML algorithm will learn the features that are associated with a class so it can classify new data.

So, to train a ML model to classify images of trains, planes, or boats for example, thousands of labelled images of the items are fed into the algorithm where features like size, colour, shape et cetera are used to distinguish the classes. After training, one can then pass in new, unlabelled images of boats, trains and planes, and the ML model will classify them based on the learning from the training dataset.

How is Google collecting data from reCAPTCHA?

As mentioned earlier, if the reCAPTCHA service suspects a bot is trying to interact with a website, it will present a test to confirm you are human. Sometimes it is a simple checkbox. Other times it is the more interesting challenge of selecting images from a set that fit a particular description. Once you have correctly identified the pictures that fit a description you are allowed to access the page you intended to visit. So, what you are doing on these challenges is providing some labelled data that will be used in a training dataset for some AI under the Alphabet Inc umbrella.

The obvious question is, how does Google know when a web user has selected all the images that fit the description? If the benefit for Google is us users labelling some data for an AI model, surely, they don’t already know what the images contain in advance. The answer is when Google presents you with a panel of, say, six images, five of the images are already labelled. The web user is asked to identify five images correctly, including, the one Google are looking to label. You only need to correctly identify the four images Google already has labelled, and your answer for the fifth unknown image goes into the AI training dataset.

What is the data being used for?

Photo by heylagostechie on Unsplash

As for what artificial intelligence this data is being used to train, this is basically unknowable unless you are inside the company. But we can make some educated guesses based on the types of images we’ve been asked to identify. reCAPTCHA challenges seem to be related to roads, traffic signals, or cars. This may be a clue that the data will go to train some model used by Waymo, Alphabet Inc’s self-driving car company. Google mention on their webpages that the data could be used to help improve maps, which also makes sense based on the images we are presented with. Again, it is difficult to know without being inside Alphabet Inc where all that data ends up going.

Final thoughts

I think most people would feel there is a sense of deception or dishonesty in the way Google uses the data we provide for what is a commercial endeavour without properly notifying users as to what is happening. Here’s the thing, I don’t believe most people would be bothered if Google made it explicitly clear that some of the answers from reCAPTCHA will be used to train Google models in the future. I do think it is important to inform people of what is happening and give the option to opt out though.

It is also worth noting that this system is only present in reCAPTCHA V2. Google now have a reCAPTCHA V3, which doesn’t interrupt users at all to detect bots. Instead, reCAPTCHA V3 scores all visitors to a site based on a range of metrics, the lower the score, the more likely you are a bot. However, reCAPTCHA V2 is still active on some websites. I will conclude by saying more transparency from technology companies should be encouraged. I can only assume the reason there is a lack of transparency is because of a worry that users will choose not to comply, but that should be a decision for us users to make.

--

--

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Recommended from Medium

BEST YOUTUBE CHANNEL TO FOLLOW Every data scientist and aspirant must need to know

Defunding Police — An exploratory data analysis

How to get started with Word2Vec — and then how to make it work

Semantic representation of neurochemical molecules — An unsupervised approach to predict drug…

Pretrained models available that every data scientist need to know

Bird data exploration in Los Angeles [Part 1]

Building A Modern Batch Data Warehouse Without UPDATEs

Intelligent Automation — Trendsetter in Claims Processing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rugare Maruzani

Rugare Maruzani

Genetics PhD researcher | Become a Medium member via my referral link for unlimited stories from me and thousands of other writers https://tinyurl.com/rbppdz75

More from Medium

A Deep Dive into Curve Fitting for ML

Pie Charts Are Good

A.I. Talks with Animals