Why small data is the future of AI

Published in

Towards Data Science

6 min readAug 7, 2018

I’ve spent the last 8 months going out and pitching big ideas for artificial intelligence solutions. I’m very frequently faced with business people who have been schooled for the last decade on the importance of data. However, this means my services often get conflated with data analytics and big-data consulting. From the business person’s perspective, their ask is simple: “We have all this big-data. Can you come in and make us more money from it?”

I find this frustrating for two reasons. The first is that artificial intelligence technology has so much more potential than just analyzing big data sets. There are plenty of well established technologies for analyzing big data sets, and plenty of well established consultants. The second is that, for most of the artificial intelligence problems that exist, there are no big datasets available for use.

Many artificial intelligence companies that are working on truly novel solutions have to gather the data sets for those solutions manually. A huge portion of the capital investment into an artificial intelligence startup is put into gathering the dataset necessary to make their product work.

Example

Let’s take one of the growing AI use-cases as an example: AI based contract review for lawyers (e.g. Blue Jay Legal, eBrevia, Kira Systems, Law Geex, etc..). In order to accomplish this, all you need is a large dataset of contracts which have been annotated with the lawyers feedback. This may include specifying clauses that appear to be screwing over one party or the other, clauses which are highly unusual for this type of contract, language which is out of the ordinary or unclear, etc…

Lawyers however are notoriously old fashioned in the way that they work. If you’re lucky, you might find some law firms have kept draft contracts with red-lines in them, and this would be useful if all your product needed to do was reproduce the red-lines. However, if you want more qualitative feedback on a contract, you’re out of luck. And even if you do manage to amass this dataset from a law-firms past work, that’s still not going to be big-data. At most, you might be able to get 10k or 20k of reviewed contracts — nothing anywhere close to the size of “big-data” that people talk about today, which are often on the scale of millions, billions, and trillions of entries.

To make this work, the AI contract review startup has to make their technology work with the size of the datasets that they can get access to. They have to be able to work with small data.

Why is small data hard?

Humans are able to learn from small datasets — why aren’t machines able to? The answer is simple — humans haven’t actually learned from a small dataset. We have, since we were born, been learning from a continuous feed of data coming in through our 5 senses.

The current largest image-processing dataset, Image-Net, contains about 14 million images. If we take a conservative estimate, let’s assume a person sees a distinct image once every 30 seconds. By the time a person enters their professional career around 25 years old, they have thus seen:

= 25 years * 365.24 days/year *16 hours/day * 60 minutes/hour * 2 images / minute

= 17, 531, 520 distinct images

Every single one of us, by the time we are entering our professional careers, have been exposed to a larger visual dataset then the largest dataset available for AI researchers. On top of this, we have sound, smell, touch, and taste data all coming in from our external senses. In summary, humans have a lot of context on the human world. We have a general common-sense understanding of human situations. When analyzing a dataset, we combine the data itself with our past knowledge in order to come up with an analysis.

The typical machine learning algorithm has none of that — it has only the data you show to it, and that data must be in a standardized format. If a pattern isn’t present in the data, there is no way for the algorithm to learn it. The signal must be greater than the noise.

Where all the action will be

This might be a distressing fact for the artificial intelligence community. For many, if not most, professional jobs there are no available big data sets in the community. Gathering a big dataset to represent that task may be prohibitively expensive. Let say you’re trying to collect a dataset to automate doing an accounting audit on a small business. Using current best-practice techniques, there are likely several required datasets (this is purely speculative):

50, 000 instances of accounting statements where “red-flags” and “areas of interest” have been annotated
250, 000 conversations with company staff to request supporting documentation on a transaction
100, 000 cases where a piece of supporting documentation has been analyzed for validity with respect to a given transaction
50, 000 final reports written on the audit

Even though this dataset doesn’t top a million entries, you can already see just how pricey its getting. If we assume an auditor charges $150 / hour, and that auditing a small business requires 20 hours of time, then this dataset is going to cost a whopping $150 million dollars. It’s no surprise that we haven’t seen any AI startups tackling the time consuming accounting audit process. The only way to build the AI auditor is to somehow make technology that can work with a smaller dataset. And in fact, that vast majority of tedious jobs we will try to automate have this same basic problem. Once the low-hanging AI fruit are used up, how do we move up to the middle fruit? How do we move up to the fruit at the top of the tree? How do we crack small datasets?

Enter transfer learning

Somehow, we need to be able to give our AI systems general purpose knowledge on the human world — the same way that humans have. Transfer learning is an up-and-coming technique that allows us to transfer the knowledge learned in one dataset and apply it to another dataset.

Transfer learning has largely taken a back-seat in the machine learning community up until recently with the rise of deep neural networks. Deep neural networks are extremely flexible compared to most other machine learning techniques. They can be trained, chopped up, modified, retrained, and generally just abused in all sorts of ways. This has lead to all sorts of novel situations where we can apply transfer learning.

In text processing, we can use a shallow neural net called word2vec in order to try and encode the meaning of a word in a vector by reading tens of billions of lines of text from the internet. These vectors can then be applied to more specific tasks
In image classification, we can train a neural network on the very large Image-Net dataset, and then retrain it on a smaller dataset for which we may only have a few thousand images.

More recently however, more ambitious instances of transfer learning have been shown to be successful. Just a few months ago, Google Brain released research on their MultiModel. In this research, they trained the same deep-neural-network on 8 different tasks simultaneously. In their research, they discovered that the systems performance improved on tasks with small-datasets. Although it’s early days, the research provides a tantalizing glimpse of how we might be able to give our AI systems context about the world that will allow them to understand small datasets.

So whats next?

I believe that big-data is nearing the peak of its hype. As more and more companies reach maturity in their collection and usage of big datasets, they will begin to ask, “What’s Next?” I believe more and more companies will be looking towards automation using small datasets as the next phase of their data strategy.

For every dataset with one billion entries, there are 1,000 datasets with one million entries, and 1,000,000 datasets with only one thousand entries. So once the low-hanging fruit has been exhausted, the only possible way to move forward will be to climb the tree and build systems which can work with less and less data.

This article was originally published at www.electricbrain.io.