Learning from Machines: The Data Supply Chain

Insights from an aerospace engineer turned data scientist

Published in

Towards Data Science

8 min readJun 25, 2019

This is Part II of a two-part series exploring machine learning (ML) products. In Part I: What Traditional Manufacturing Can Teach Us About Data Products, I draw parallels between physical products and ML products. In this second part, I discuss how these similarities can help data scientists and data organizations approach product development and mitigate the risks associated with data dependencies by utilizing lean manufacturing and supply chain concepts.

To summarize the previous article: the performance and continued reliability of machine learning models are highly dependent on the quality of the data. In that sense, data products are analogous to physical products in specific ways. Both:

require a continuous supply of high-quality inputs (datasets vs physical parts) to produce/deliver consistently;
are at the mercy of the supply chain, especially where the product relies on data not generated internally (e.g. open source data vs purchased manufactured parts);
require re-designs over time as needed to meet sustainability and/or performance needs. This is often heavily reliant on the supply chain (e.g. if certain datasets are no longer maintained and data needs to be replaced/removed vs parts becoming obsolete and no longer on the market, the product needs an alternative solution).

Like raw material or parts, datasets are tangible, critical components that need to be managed well to ensure the continued quality, reliability, and availability of the ML product. This is a responsibility that is ultimately shared across different roles in your data organization and different decisions and considerations need to be taken with each dataset.

Data Supply Chain

Definition: The lifecycle process of data; selection, procurement, transfer, quality assurance, warehousing/storage, data management, transformation, monitoring, and distribution — feeding data pipelines for use in data products.

One of the most useful and intuitive concepts I learned from lean manufacturing is to reconsider your cookie-cutter processes. Being strategic at a more granular level (i.e. at the raw material or parts level) helps organizations to better manage product inventory whilst reducing overhead cost. When your product is built from data, the central idea is having a different plan for each dataset.

Plan for Every Dataset

Lean manufacturing principle: Plan for every part (PREP)

This is about understanding how comprehensively a dataset needs to be managed, what resources it requires, the costs and risks associated with it and considering these elements, creating a plan for each dataset that reduces maintenance and management burden (happier data engineers) whilst ensuring quality.

As all data scientists, data analysts and data engineers have experienced, not all datasets are created equal. Some have well-designed schemas while others were by-products of a process, some are more trustworthy, some have poor coverage, some are messy and require a heavy cleaning and others are not routinely updated. Some may even become unavailable in the future. In addition to this diversity, the truth is certain datasets are more critical to ML model success than others. More planning should be dedicated to the most critical and the more difficult to handle. The idea here is to identify the datasets required to build a performant ML model and make sure these datasets are assessed and deemed fit for use in production at the model prototyping phase. This will save a few headaches down the road. Once they’ve passed scrutiny, ensure the most critical are planned and tackled first. Each will likely require different approaches to securing the supply, quality monitoring, data management, and pipeline maintenance.

So you’ve aligned with the business needs, and you’ve built a prototype model. Now you want to deploy it to production. You’ll need to consider many things to do so effectively but depending on the answer to the following question, you may benefit from incorporating supply chain management concepts into your plan for each dataset:

Does your ML model or data product rely entirely on datasets generated within the organization?

If the answer is yes, you probably won’t benefit from some of the following supply chain concepts, since (hopefully) you can coordinate with departments to design data pipelines that fit your ML product needs. However, if you need to obtain open source/third-party data, or have difficulty reliably securing data internally, the following concepts can help minimize ML product performance risk.

Dataset Sourcing Strategy

Procurement lingo: Single source vs multiple vendors, sole source and Buy vs Make.

This concept is mostly applicable to securing raw or derived datasets based on open source, publicly available sources, third-party vendors or those obtained from business partnerships. The plan does not need to be fully fleshed out during the prototyping phase but it’s important to start early on in the process. However, once you’ve scrutinized the different datasets for quality, trustworthiness, completeness, etc. and you have a minimum viable product (MVP), creating a data sourcing plan will help make your transition to production smoother.

The most conservative strategy would be to obtain data from multiple sources. Having an alternative source is a great way to protect your product from external issues outside your control, but requiring all external datasets to have a backup source immediately available is sometimes unrealistic, costly and often unnecessary. Similarly, even if the dataset is critical but widely available, you may decide to rely on one source, with the awareness that you’ll need to make the switch to another provider should you lose the first (and of course retrain your model), as long as having that period of downtime is acceptable for your users. In the manufacturing industry, this strategy is called ‘single source,’ where the benefits of working with a single supplier — reduced overhead/maintenance, cost savings, etc. — outweigh the costs of working with multiple suppliers.

However, if a comparable alternative dataset does not exist, you may have to form a partnership with your source and create a supplier partnership. While reliance on unique datasets should generally be avoided, this dataset may give you a competitive advantage that outweighs the risks. In these scenarios, you can guard against unexpected loss of these datasets by developing strong partnerships with the source’s organization.

Where data is found to be critical but no production-ready datasets exist, you may have no option but to generate the data yourself (or sub-contract the effort). This is typically less than ideal, but there may be cases where creating a dataset in-house solves a critical data need and is worth the resource investment. This is something I ended up doing with a personal classification project. I realized I couldn’t rely on crowdsourced effort due to the specific domain knowledge required to tag labels.

Inspection — Photo by Shane Aldendorff on Unsplash

Data Quality Checks

Quality Engineer speak: Incoming inspection & subassembly testing

Once you’ve accepted a data source, and possibly an alternative source, you’ll want to continue to assess the quality of data as new data arrives and as it is aggregated and transformed throughout the pipeline. Just like physical products, it takes only one defective datapoint or a bad batch-refresh to cause your product to fail. To avoid this scenario, determine the acceptable level of quality and monitor each dataset by building data pipeline tests. These will identify when data is failing to meet expected schema, data type or value ranges. Values can be checked in a variety of creative ways for example by applying simple in-range checks, statistical distribution tests and trend checks or other more advanced business rules. This logic should be designed and reviewed by data scientists and analysts — those who understand the datasets and ML model dependencies well. Checking and storing logs of data quality results will help pinpoint the presence of spurious data. This is especially useful when investigating why an ML model is unexpectedly underperforming

It’s important to understand that automated quality checks can’t catch everything and you may not know what to implement right away. It’s a process of trial and error — what’s relevant to check and what’s not? What to do if the dataset is flagged? This logic needs to be revisited over time as the data changes and the product matures.

Ongoing Source Assessment

Supply chain management reference: Annual supplier review

Once you’ve built your production data pipelines, you may want to periodically check each dataset’s performance (review those data quality logs). Does the data provider keep changing the format/schema? Are there any value consistency issues and do they persist? You should also look at usage and perform assessments periodically to see if their datasets are still valuable to your product or organization. Is the supply still reliable and sustainable? Are there changes in pricing or are changes in the overhead costs of wrangling, processing and storing? Have alternative data sources emerged that would better suit your needs or enable enhanced capability? If so you’ll need to determine if the data is still worth it.

In reality, ongoing source assessments will vary in depth and frequency depending on the source and how critical their datasets are to your ML products. If your business is working closely in a partnership, you routinely encounter data quality issues with a source or if you want to further extend a dataset’s functionality beyond what it currently supports, you may want to review these sources more often.

To take full advantage of these concepts, as a data scientist you’ll need to continuously collaborate across the organization from the prototyping phase to product development, release, and maintenance phases of your products and actively contribute to the product data strategy. A data supply chain cannot be built for a product and just left to run. It will continually evolve with the data and the product — the work is never quite done. There will always be emergency scenarios to resolve and tweaks to be made over time. If your ML product relies on anything outside of your control, farsighted management and planning is crucial.

In summary, every dataset used in production ML models is a core component of the product. The value, usage, availability, and reliability of each dataset should inform how your organization manages it. To proactively manage data supply risks you’ll want to assess your sources, evaluate how you secure your supply, monitor data quality, and periodically re-examine each dataset’s plan. Ultimately, it’s the responsibility of the whole organization to build and support a robust, reliable, high-quality data supply chain to ensure the continued success of ML products.

Hi, I’m Katie Lazell-Fairman. I’m a data scientist and founder of artluxe.co, based in New York City. Have questions about this post or are you curious about this topic? Comment below or feel free to contact me!