How to innovate in Data Science

Sharing my thoughts on innovation, along with one favorite Data Science project in 2014

Pan Wu
Towards Data Science

--

Photo by AbsolutVision on Unsplash

Millions of innovations happen in the world every day. Innovations create new products, services, business models, technologies, and even new scientific fields; if there were no innovation, we would be living in a totally different and boring place. Innovation is also critical in the Data Science field: Data scientists transform data into actionable products/insights, and such transformation constantly requires one to innovate beyond the status quo.

Innovation is production or adoption, assimilation, and exploitation of a value-added novelty in economic and social spheres; renewal and enlargement of products, services, and markets; development of new methods of production; and the establishment of new management systems. (Wikipedia)

Meanwhile, innovation is not natural: yes, we all have our opinions and ideas, however, innovation is a “value-added novelty”; hence “novelty” itself does not equal “innovation”. So, what are the types of innovation, and how can we develop innovation as a skill?

Two types of innovation

In my opinion, there are two types of innovation:

  • Type 1. identify a better alternative solution to address an existing problem
  • Type 2. redefine the existing problem into a more meaningful one and solve it with whatever method

Here is one example to better illustrate the concepts. Let’s say: you want a computer with more computational power, and one known issue is the CPU speed is slow. Type 1 innovation would be to improve the CPU speed, by decreasing the semiconductor fabrication size from 800 nm to 130 nm, to 14 nm, now it is approaching 5 nm in 2020! So the computational power increases accordingly. Type 2 innovations are like a) enable multi-core CPU architecture, or b) integrate GPU as a co-processor to accelerate CPUs for general-purpose scientific and engineering computing. The ultimate goal is the same: more computational power: while Type 1 innovation focuses on an already defined way, and Type 2 innovation expands to alternative paths.

Both types of innovations create business value, however, they usually associate with different viewpoints.

  • Type 1 innovation requires one having strong domain expertise so that one can push forward the technology boundary; a more focused view. Top research labs in University focusing on total synthesis (wiki) can be a good representation of such innovation.
  • Type 2 innovation requires one having a big picture of related knowledge space and the mindset to connect dots; a more holistic view. Many legendary business stories, e.g. the birth of the iPhone, can be characterized by this innovation type.

Now we defined the innovation types, the next questions are: how to proactively develop innovation as a skill? Is there a formula, like “A + B = Innovation”? Instead of given out my subjective answer directly, I would like to share a story first: this is one of the favorite projects in my first job, and I think it would be much more contextual than a dry claim.

The Story about BuildX

Background

I started my Data Science career in Aug 2013, and my first company operates an automotive pricing and information website. In 2014, the company started a strategic initiative, and its success depends on acquiring detailed configuration data for all inventory vehicles, which was not available. For example, given a VIN number “1FATP8FF4L5106855”, we know it is a Ford Mustang, 2020 GT Convertible Premium, based on the VIN string pattern; however, there are many options installed on the car, like “FORD SAFE & SMART PACKAGE”, which could NOT be extracted from the VIN pattern itself: while it is such detailed vehicle option configuration the company needed the most.

Half of Windows Sticker for VIN: 1FATP8FF4L5106855. Source: http://vin.maniacs.info/FordSticker.html?title=1FATP8FF4L5106855

The product team started huge efforts to acquire such information through partnerships with both 3rd party data providers and manufacturers (e.g. Ford, Toyota, etc). The Data Science team was involved to evaluate data quality, assess usability, and provide insights on whether our inventories are sufficiently covered by newly acquired data. Clearly, the data science team was playing a supportive role and it is quite understandable from the project nature. The project’s goal was to achieve data coverage above a certain threshold (e.g. 80%) so that other downstream products can be built. Unfortunately, a few months passed by, the team started to realize that data coverage was still far below the threshold.

Being a Data Scientist in the team, I learned a lot about how business works even playing a supportive role, and it was a great experience to collaborate with multiple teams on data evaluation and integration. However, I had a feeling that there is something extra that Data Science can provide.

Redefine the problem with a dummy solution

One thought occurred to me was: we have spent so much effort on purchasing data, is it possible to generate it in-house? Given my limited tenure (< 1 year) with the company then, I decided to check whether this was thought before. It was a regular sync-up meeting, the product Director just finished sharing the status of data acquisition, and the potential need to modify our target. At the end of the meeting, I raised my question very carefully: “it might be a bit naive, but, is it possible to generate the complete vehicle information in-house?”. The answer was: “we thought about it, but the VIN does not contain that much information; how do you think?” Since I also didn’t have a clear idea yet, not much was discussed in the meeting; but at least I knew this is an idea untouched, and I asked for one week to explore this direction.

The one-week time was definitely very tight, so my first approach was to look for business logic that can provide immediate boosting on the data coverage. After some digging, I found one “solution”: if we know that a specific vehicle model has NO available option at all, then we can declare that we know the FULL configuration of this vehicle. This feels like cheating in a sense, however, from a product perspective, it is helpful to know we have the full configuration, even the configuration has nothing.

My argument: although it has nothing, now we know this fact; and that is valuable.

Then I shared this insight with the team the following week, everyone was happy to see the “free” coverage boost from this simple exploration. However, we also knew this is not a universal solution, there are only so many vehicle models with no available option and we cannot expand it to all vehicles. So I asked for two extra weeks to find a more expandable solution, and it was approved: the “dummy solution” brought me more time :)

Prove concept a simple, but expandable, solution

Here is the fact: there is no way to identify all vehicle options purely based on the VIN number unless we have some extra information. So I looked deeper and found something useful: each VIN is associated with a fixed Manufacturer Suggested Retail Price (MSRP), meanwhile, each vehicle model also has a fixed base version MSRP, so the difference between “VIN MSRP” and “base MSRP” is the “total option MSRP”, and this will be the extra information we have for all inventory vehicles.

Now let’s think about the following situation: you have a car, the VIN shows a 22K MSRP and the vehicle model (e.g. Toyota Camry LE) has a 20K base MSRP, now there is 22K-20K=2K MSRP from all options added. Assuming this vehicle model has 4 available options: with their MSRP tag 1.5K, 1.0K, 0.5K, 0.2K, respectively. What would be the right option(s) installed in the car?

It is pretty straightforward to find out the answer given this 4-available-options case: we just enumerate all possible option combinations and check whether each combination has the same total option MSRP. If it is a unique match, we can claim the option combination to be the right vehicle configuration.

You may immediately see there are two challenges about this approach: 1. there may be multiple configurations matching the same MSRP, then how to find the right one? 2. the option combinations grows exponentially with the available option #, that could be a huge problem. However, I can enforce following two rules to bypass the challenges: 1. only claim the right configuration if there is only one configuration matches the MSRP; 2. set a timeout threshold (e.g. 30 seconds per VIN) to ensure calculation be done on time. Now, we have a solution! This solution is very simple and expandable to all vehicle inventory. Although we didn’t fully solve the above-known challenges (we were just bypassing them), it is already a big step forward than our “dummy solution”.

We quickly tested out this idea, and indeed, there is a nice coverage bump; however, we still have a gap away from the target. This is mainly because many vehicle models have between 30 ~ 60 available options, with extreme cases that can go up to 100. In the case of 40 available options, 2⁴⁰ configurations need to be evaluated, that’s more than one trillion configurations! Many VINs hit the timeout threshold and return no option combination at all. Now we know the exponential growing computational cost is one limiting factor, and it is an NP-hard problem by definition, how should we proceed further?

Develop an advanced algorithm for a scalable solution

When facing a steep challenge like this, I take walks outside the Santa Monica beach and have the fresh air clear my thoughts. Since I took various courses in my spare time, one happens to be discrete optimization, one algorithm jumped into my mind: constraint programming.

The high-level concept about constraint programming is that you don’t need to exhaustively search all the configuration space. You can build constraints so that some configurations are never visited. For example, if the total option MSRP is 1K, it is impossible to have a 1.5K option added, hence any configurations having the 1.5K option will be unnecessary to evaluate; on the other hand, if the total option MSRP is 10K, one option “A” is 5K, all other options MSPR sum up to 8K, then you know option “A” is definitely on this vehicle and any configuration without “A” do not have to be considered.

The following visualization illustrates the fact that the constraint programming algorithm serves as a smart guide to search through all configurations (represented as a binary tree), by avoiding getting into nodes violating the constraints. In this example, configuration space has 2⁴ = 16 combinations, and the algorithm only requires 5 movements.

With the help of Python, I encoded the algorithm into a package (a minimum viable data product), and injected a sample of existing vehicle inventories into the flow: it works! Now I have full confidence that we can hit the coverage target! This was big news to the team, and everyone was looking forward to the algorithm production: now, the project shifted from “Business-Development-driven” into “Data-Science-driven”, and we were rewarded with full support to build the data product.

To market our effort, I named it “The BuildX Project”

Building a data product has multiple layers of complexity: the algorithm is only one piece of that. Eventually, we delivered it. To market this data product and our effort to external teams, I gave it a name: “The BuildX Project”. The “Build” part represents the fact we are rebuilding the vehicle configuration, while the “X” part encodes all the algorithm complexity and makes it feels mysterious and powerful. One can see more details about the architecture in the patent we filed, and the following chart shows the high-level system architecture.

Our partner teams loved the name! With their help in advocating the data product’s power, we have the code integrated with other backend systems and eventually, the team achieved the data coverage goal. This story had a happy ending.

The formula of innovation

Looking back, everything looks so well-orchestrated: the innovation happened right there! The “why” question, the simple solution, then the magic beach walk led to the “scalable solution”. However, at the time when this happened, it felt much unplanned: I asked a few “why” questions because I didn’t know much about the existing system; I happened to take the discrete optimization class a few months earlier because it was just released on Coursera; I started having regular beach walk because the company organized a “walking competition” (and I didn’t win …). So how are these random things, once put together, turned into a “value-added” innovation? Is there a secret formula of innovation, and somehow I executed that exactly?

Unfortunately, there is no magic formula for innovation, not a deterministic one. On the positive side, I found several elements can help facilitate innovation: with these elements, you are more likely to bring innovation to a project. The elements are:

  • Challenge the status quo, respectfully. This usually goes well with the question “why”: regardless of how great the status quo solution is, after a few “why”, you find something could be improved.
  • Keep learning. Any knowledge outside your comfort zone worth learning: going deep in a domain helps Type 1 innovation, and going broad across related areas helps Type 2 innovation.
  • Connect the dots. “Dot” refers to any knowledge, experience, lessons. Relax your mind and let your brain bring related dots together. Once they are connected, magic happens.

To conclude the story, I like to share one more comment about the “nondeterministic” nature of innovation: you may ask a lot of great “why” questions, learn 10+ courses related to your field, and constantly think about how to connect “dots” to a specific project. However, the project simply has all components running well, and you could not bring revolution. Would that be a sad innovation story? I don’t think so. Although the outcome of successful innovation is rewarding, the path toward innovation is what matters the most. When one starts to innovate, it is like taking a walk on the beach, you start to look around and suddenly notice a shiny object in the sand; you pick it up, dust it off, rub off the dirt, and find it is something unexpected. It could be a giant pearl, and if so, lucky you! It may also be a beautiful, well-shaped shell that you can decorate on the desk. The moment of serendipitous discovery brings much more joy, and that’s the most exciting part of innovation; I hope you enjoy that.

Photo by Aaron Burden on Unsplash

*The article is also available to access on LinkedIn https://www.linkedin.com/pulse/how-innovate-datascience-pan-wu

--

--