AI: Fake it ‘till you make it

Published in

Towards Data Science

6 min readAug 4, 2019

There are regular rants about companies that pretend doing AI but actually don’t. The truth is sometimes they’re just planning ahead — and it’s actually a good way to get started when solving big challenges. Let’s see how it helps and how to do it (really) right, or, how to build an MLMVP (Machine Learning Minimum Viable Product)!

What if AI was just a big scam? Taken from “*Scooby-Doo, Where Are You!”* episode “Hassle in the Castle”

What does AI really brings you?

Wait: why do you want to do AI in the first place, by the way? What does AI (or Computer Vision or Machine Learning or NLP etc) would bring you?

Sometimes, instead of AI / Machine Learning, I like to think about projects as “Computer-aided decision”, and I always remember that machine learning is, as Cassie Kozyrkov says, “just” a things labeller.

A face detection algorithm? Thing-labeller.
A visual metal fatigue detector? Thing-labeller.
A chatbot, whose first task is to understand what the user intends to say? Thing-labeller.

Loooots of AI projects start with a program that labels things right.

As far as I know, humans have been doing a loooot of thing-labelling without machines. Machine learning is a relatively newly adopted computer science field but the world existed before. How did we solve the tasks that we know give our neural networks?

We did it either by hand or with a lot of programming. If you remember well, deep learning made feature extraction independent of prior knowledge, which is pretty cool….. when it works.

Actually, machine learning gives us 3 major breakthroughs:

Automatic feature extraction, provided we feed the NN with enough examples. That means faster conception (or training) compared to manual feature extraction;
Instantaneous answer (your mileage may vary but for our work at NumeriCube we often expect algorithms to answer under 0.5s)
Scaling. Massive parallel processing powa. ML + Cloud = massively scalable thing-labeller solution. Human can’t scale as well and as fast.

These are the 3 main advantages a machine learning algorithm would have over a human intervention.

In an ideal world…

What does anyone need to build a thing-labeller?

Lots of example data
A state of the art analysis to begin with, aka AI expertise(that’s where experience talks :))
A lot of time and resilience to the long trial and error path.

Pretty easy, isn’t it? Normally, a Data Science app building process looks like this:

Obviously, you need data to get started. In an ideal world, you’d just build or ask someone to build a clean dataset for your problem and start working on it. However it’s not always possible, and, actually, very often it’s not possible at all.

Example data? My personal experience is that when a project starts with an already existing dataset, about 99% is going to be discarded, mostly because lab data is practically never acquired the same way as production data.

For example, think about an image recognition app. You could train it on Flickr from millions of images already annotated, right? Well, except that when your average user takes a picture with her smartphone, the result is not going to look like the neat photoshopped DSLR+flash pictures you built your training set on. Damn it.

So, you have a business problem, you’re pretty sure you can tame it with enough data, and you know that the best data you’re going to use come from the production system.

In other words, you need your system to be in production in order to do the data collection the right way! Looks like a chicken and egg problem!

Photo by Sebastian Staines on Unsplash — Does the data come before the app, or the app before the data 🤔

How to do this when you work at Google?

With big platforms (think about Google, Facebook, Apple, etc), it’s very easy to separate the data collection process from the data science process. Why? Because these companies already have a huge data collection workflow in their backpack.

How big companies are doing data science (thanks to other apps)

In a way, Google, Facebook, Amazon et al. spent the last 10+ years building data collection apps. For example, Facebook allows you to tag your friends on your photos since at least 2008! They now have a massive facial detection dataset… It works well because their data collection application already have services included.

But if you want to build an AI service from scratch, you can’t do this. If you don’t already have data, you have to collect it in one step.

Side note: actually, this is what happens to most of the big companies with Autonomous Vehicle dataset collection — and probably the reason why they’re struggling so much with it: apart from Tesla, they didn’t have a data collection process in place when they started working on the subject.

If you’re reading this blog post and you’re not Google, then you probably neither don’t have a nice dataset ready for use. So, how can you build an AI application AND collect data at the same time?

Wait a minute: do you really need real-time and scaling right now?

The answer is: fake it. Use a human to do it.

The app building loop now looks more like this:

The loop for collecting data while training an algorithm

By working this way, on the left part of the diagram, you’ll loose:

Instant answer (except if you have a bunch of slaves stuck to a computer screen, it’d probably take more than one second to answer a thing-labelling query);
Scaling (except if you rule a country where you can force-enrol a bunch of slaves, which is out of scope of this story ; also, we don’t encourage it);

Is it a problem right now? Does your application really need to perform in real time? In case you’re building an autonomous vehicle the answer is probably yes, but are we so sure? Why couldn’t an autonomous vehicle of some sort being driven by humans (even remotely)?

In most applications, from quality control to sentiment analysis, real time is a big advantage but what if you don’t have real time at start? Does it really prevent your application from being used?

In most cases, MVP for Machine Learning applications should avoid real-time processing.

Once you setup your app to do this human-in-the-loop approach, you just have to collect data for a while and let your data scientists work while already satisfying early stage users.

Do not fail miserably at this!

Even if we encourage this approach, there’s a huge caveat: if your machine learning algorithm never works, you’ll fail miserably. So, in order to perform things this way, you have to be sure to be able to deliver it at some point!

However, the main advantage of a human performing the real thing is that you’re pretty sure, given the proper conditions, that a machine learning algorithm will eventually catch up with human precision. Why?

If your operators work with limited information (ie. work with the same amount of information than your algorithm will), then you’ll pretty sure you can find relevant state of the art to give you reinsurance. Image recognition contests or NLP contests will give you a rough idea of how difficult these tasks are.
If your operators are only trained for a few minutes with a few examples and do not require expert knowledge to perform the task, chances are the challenges are not too difficult for a machine learning algorithm.
By putting this pipeline in place, you’ll MECHANICALLY have to preprocess your data in a way that will be both suitable for your operators (and hence for your data scientists ;)) and reproductible in production since you’re already in production.

Doing so led me to change the data capture strategy for a big computer vision app we’re building at NumeriCube. And it probably saved thousands of hours of labelling!

We’re in the process of releasing the tools we use at NumeriCube for our day to day Computer Vision projects: visual data collection, annotation, storage and management. If you enjoyed this post and/or want to stay in touch, I’d love to hear from you in the comments!