How to make an idea machine, part one

Jesse Paquette
Tag.bio
Published in
5 min readJan 31, 2017

--

“Give a person a discovery from data, and they will innovate for a day. Teach a person to make discoveries from data, and they will innovate for a lifetime.Ancient Data Science Proverb

We already know that we can’t train enough data scientists to meet even today’s demand, and the size and variety of data being collected by organizations is increasing exponentially by the minute. Simply training more data scientists will not solve this problem. So let’s extend that proverb with one more sentence:

Build a machine that enables any person in an organization to make discoveries from data, and the entire organization will achieve perpetual innovation.

That has been our mission from day one at tag.bio — to automate away inefficiencies and thereby accelerate high-impact discovery from data across all industries. In this article, I discuss why and how our “idea machine” works.

Breaking down the process

Discoveries from data can give organizations critical insight about past successes and failures, or they can identify emergent phenomena occurring in a field or industry. The impact of any given insight ranges from modest to monumental — some discoveries completely transform how an organization operates.

It’s critical to note that neither the success, nor the quality of any discovery is guaranteed. Organizations must focus on maximizing their probability of high-impact discovery, while considering time and resources expended.

In my experience, the greatest probability of producing a high-impact discovery comes from repeatedly connecting the right person with the right information.

Who is the right person?

Domain experts are specialists with many years of education and experience in an industry — they’re the right people to make discoveries. I learned this first-hand during my time in cancer research. Domain experts — such as the cancer biologists and doctors with whom I worked — have extensive and specialized knowledge. They have ideas. They have deep questions. But they can’t do data analysis, and that’s a problem. For that, domain experts are entirely reliant on data scientists.

I’d hypothesize that there are around 100 times more domain experts than there are data scientists (in the image above, combine the 90% and 9% groups, and compare them to 0.1% and 0.9% groups). Of course, actual ratios will vary by industry and over time.

If we consider that many domain experts already have some data analysis ability using Excel or Tableau, then the ratio changes (compare the 90% group to everyone else) — but there remains a significant discrepancy.

Data scientists are a bottleneck to discovery

Here’s a typical scenario. First, one of the many domain experts in an organization approaches one of the few data scientists and poses a deep question.

Note that it takes time (minutes to hours) to communicate the question in a way the data scientist understands. Also note that it takes even more time (hours to days) just to schedule and wait for that conversation to happen.

After gaining a proper understanding of the question, the data scientist formulates a concept of how best to answer the question.

The data scientist then accesses and transforms the data (minutes to days), performs the analysis to produce answers (minutes to days), and then they wait (hours to days) to meet again with the domain expert.

Finally, the data scientist communicates the answers back to the domain expert (minutes to hours). With luck, those answers represent an interesting discovery from data — but more likely than not, those answers just lead to more deep questions.

Three major problems are evident in this process

  1. Time. Note all the steps above that had the potential to take days. Add those up, and in a worst-case scenario, a single iteration of this process takes weeks to months. Even the best-case scenario can be expected to take hours to days.
  2. Limited resources. Because there are so many more domain experts than data scientists, only a handful of deep questions can actually be answered at any given time — with priority typically given to questions from VIPs who are not necessarily the right people.
  3. Giving up too early. This is by far the biggest problem. If it takes so long to answer one deep question, then there is no time left for answering any follow-up questions.

And there’s the rub. Discovery from data doesn’t typically happen when the first deep question is answered, because the answer — no matter how precise — only has a low probability of being the right information to produce a discovery. The right information is only achieved after a thorough series of question/answer iterations — in other words, by “drilling down”.

Organizations today don’t have enough resources or time to dedicate to this process — because the process itself is grossly inefficient. And thus, despite all their investments in data lakes and business intelligence dashboards, organizations are failing to make high-impact discoveries in data.

Even worse — each failure to discover diminishes belief in the transformative capabilities of data analytics.

The solution to an inefficient process (as usual) is automation

In part two, I provide specific details on how creative software design and architecture can automate a significant portion of the bottlenecks described above — accelerating discovery by an order of magnitude — for any data source. And just like other processes have been improved by automation, it feels like magic.

--

--

Jesse Paquette
Tag.bio

Full-stack data scientist, computational biologist, and pick-up soccer junkie. Brussels and San Francisco. Opinions are mine alone.