I wanted to call this article "Parasitic Labeling of AI Training Data," but apparently that’s too complicated. What I want to tell you about is an often neglected aspect of machine learning: data labeling.
Supervised machine learning with deep neural networks, is the most common kind of AI out there. Supervised learning means learning from labeled data. But often you don’t have labeled data. You have unlabeled data. Even more commonly, you don’t have GROWING data. Labeled or not, if your training data fed into the AI is a fixed size dataset, then your AI is not going to get smarter over time. What you really want is a system where the AI learns from data labelled in the wild. It gets smarter while you sleep, and you don’t pay a penny.
Unfortunately the most common mechanisms for bootstrapping AI these days is human-based labeling with mechanical turk, or even full-time data labeling firms. Sometimes the job is annotating a text corpus, and in other cases they annotate images.
When you think about the reason we are deploying machine learning for our clients, it is typically automating human intuition. We are replacing with software a function that was previously performed by a human.
Machine learning is about finding a function f that maps input data X to output data Y. Or, as we learn in high school:
Y=f(X)
Because the thing we are trying to approximate (f) is a human, we need to collect training data on what decisions the human makes (Y), and what raw data the decision was based on (X).
Now that we talked about what our AI is doing, and the need for labelled data (X and Y), let’s see how we get the data labels for free. We discussed that to avoid paying for data labeling, you want some magic data annotation solution. Wherever possible, get the users of the AI to do the labeling. If you do it right, they won’t even notice.
It is super important to sneak in a feature into your design that grows your training data. For example, imagine that in Google Maps you speak into your smartphone and ask for the address "46 de la Côte-des-Neiges Rd." This causes the application to return a bad address for my poor French accent as shown in the Figure below.

The magic here is this: When I press the back button right away, and google knows I pressed the back button. This is a hint that the translation of the address from my voice was incorrect. Pressing the back button implies a mistake in address translation that can be used to improve the accuracy of the application. Similarly, when I follow the directions to the address this is a good sign that the model guessed correctly that the address I spoke is the address I wanted.
In a similar example, the messages that I pin to inbox in Google Inbox indicate to Google what unlabeled messages I might want to pin as important items in the future. Lo and behold, a couple of days ago Inbox started a "Highlights" section that shows me what messages seem most urgent.
When you get this data collection going, there is a wrong way and a right way to do it. If the user feels their time is being wasted (e.g. before you check out, label this image), then the results you will get are low quality garbage. When YouTube asks you to fill in a survey, the results will be AWFUL.

High quality data collection from your users can be used to improve machine learning over time, but it has to seem effortless in order to trust that the data quality is high. Another trick is to align the incentive of the users to the data labeling task. For example, users have a high incentive to press the back button in a user flow when the AI made a mistake, but a low incentive to do so otherwise. Sometimes this approach to data collection is just not possible, but it is worth making your best effort to get this working.
And so in conclusion, get your UX team in on the machine learning conversation. Find some way to snap a data collection feature into your AI solution, so that you reap the many benefits of user-generated high quality data.

Before I go, a big thank you to our online followers! We reached our first 1K followers on Medium.com If you enjoyed this article on Artificial Intelligence, then please try out the clap tool. Follow us on medium. Go for it. I’m also happy to hear your feedback in the comments. What do you think?
Happy Coding!
-Daniel [email protected] ← Say hi. Lemay.ai 1(855)LEMAY-AI
Other articles you may enjoy: