The world’s leading publication for data science, AI, and ML professionals.

Businesses are Buying Datasets When They Should Be Finding Signals

Access to external data solves one part of the problem, but creates another.

AI has brought an insatiable hunger for data. Big data never seems big enough. There’s always a desire for more volume, more variety. Today, businesses compete not so much to have the best ML algorithms, but the best data and domain expertise.

To serve this appetite, commercial providers offer data sets of every description – financial, geospatial, biological. Businesses buy these datasets, organize them, pull out the attributes they think will make their models more predictive and discard the rest.

But note that word: "think." They don’t know what they need. They guess. They assume. And that’s a problem.

Photo by Joshua Sortino on Unsplash
Photo by Joshua Sortino on Unsplash

The issue

The whole point of using external data is to improve the results of AI and Analytics beyond what can be achieved with data from within an organization’s four walls. Access to external data solves one part of the problem, but creates another. The more options a business has, the harder it becomes to find what combination of data provides the best results. It’s like going out to buy a fork and being forced to buy the entire 300-piece dinner set.

Another way to look at it is that businesses buy data sets, but don’t really need data sets. They need specific ‘signals’ inside them. They would never throw the entire kitchen sink of a regional SMB business dataset into an AI model. Instead they would pick out signals that seem predictive, like foot traffic, median income, website traffic, even annual precipitation.

Photo by Chinh Le Duc on Unsplash
Photo by Chinh Le Duc on Unsplash

A Business might purchase data from a dozen data providers only to use one signal (or attribute) from each. Many of these datasets might even contain the same attributes. Even setting the problem of redundancy, the sheer profusion of data quickly becomes bewildering. Looking out over thousands of signals available, how do you figure out which ones you need to fill the gaps in your model?

The need

This is why it’s so important for businesses to focus from the beginning not on the datasets they can have, but the signals they need. With today’s data and ML technology it’s possible to aggregate datasets from various sources, identify and tag the most relevant signals and harmonize them for immediate usage. With this approach a business might find a hundred signals coming from two dozen data sources, without having to evaluate and procure those two dozen sets.

Since the signals are tagged and harmonized, an ML-based system can recommend the ones most relevant for a particular analysis. For example, foot-traffic data or online reviews and ratings might inform credit risk predictions for a small business. The system can uncover hundreds of such signals that a data scientist might never have thought to include.

Moreover, once these signals are harmonized, there’s no need to worry about data matching and integration. The system can automatically match and integrate internal data with the selected signals.

The opportunity

This approach – putting signals before datasets – might seem like a small procedural adjustment, but it transforms the day-to-day use of AI and analytics. Rather than being limited by internal data, or what a business can procure on its own, suddenly it can access any signal it might need, allowing it to build any analysis it can imagine.

It’s very similar in concept to composite applications, where developers combine contextual services from disparate applications to compose processes like order-to-cash. Doing the same with data analysis – picking predictive signals out of a universe of data – without needing to buy entire data sets, gives businesses the freedom to bring their biggest ideas to life.


Related Articles