Notes from Industry

Data analytics to solve manufacturing problems
Before transitioning into the career of Data Science (DS) and Machine Learning (ML), I spent more than a decade as a Semiconductor Technologist and designer in the tech industry of Silicon Valley. I was part of the organizations that had massive manufacturing footprints – front-end wafer fabs, back-end assembly facilities, etc.
We had great successes in launching high-performance products and developing novel technology nodes. We also had our fair share of manufacturing issues. I always maintained that many of these problems can be better analyzed and solved using the modern tools of DS and ML.
And, of course, it is just not me. A lot of clever folks are talking about it too.
Naturally, after I moved into the world of data science, I have often been asked to illustrate how common Manufacturing analytics problems can be solved with quick programming tools and techniques that can be used by anybody. In this brief article, I try to showcase one specific example – the drift issue with machines and testers.
The machine/tester drift problem
Modern manufacturing plants employ enormously complex machines and equipment for production, packaging, testing, etc. Although designed for high quality and stability, no machine is immune to drift and variance. Over time, sub-components start to behave slightly differently than what they did in the installation phase, and the overall characteristic drifts.

Detecting this drift is important for many reasons,
- to identify which machine (among a set of them) may require repair and maintenance
- to correct/reset the drift by software or hardware intervention
- to correlate product quality with the machine drift
- to investigate yield issues or reliability failures that might have been caused by the machine drift and increased variance

Individual drift detection
Detecting drift and change for an individual piece of equipment is a common manufacturing problem. This can be handled using various approaches, especially if some sort of test/sensor data is also available. A few approaches are,
- looking for and counting anomalies in the process – if their density exceeds a threshold, then the machine might need repair/maintenance
- monitoring the statistical distribution (not necessarily Normal/Gaussian) of the data associated with a machine and detecting when that changes
- performing some sort of simple statistical tests (e.g. comparing means to a golden standard machine) every day/week and detecting if there is a significant change
Although designed for high quality and stability, no machine is immune to drift and variance.
Which machines are falling out most?
Often, the most prudent task is to identify the machines (among a set of many identical machines) that are drifting most compared to a golden standard. Ideally, we want to rank them by assigning some sort of numerical score.
This is often practiced in real-life scenarios. Why? Because previous experience and data analytics might have shown that a small drift does not impact the product quality or yield, and only the machines which have drifted most should be stopped for intervention and maintenance.
Business managers and manufacturing VPs will tell you that intervention costs money and should only be done when unavoidable. Your data analytics, they expect, will guide them on the most pressing problems, and not just throw general statistical analyses and charts at them.
Not a trivial data problem
With all the hype around Industry 4.0, Digital Transformation, and Smart Manufacturing, you will be surprised to know that the most significant hindrance in the path of such transformative changes is still the basic data collection and ingestion.
Machine move, operations are executed, products are generated. But, most often, they are not digitally recorded. If they are not properly recorded (just in time), the associated data is lost to the Universe, contributing to the ever-increasing Entropy. They are not useful for any subsequent data-driven analytics pipeline.
Often, the most prudent task is to identify the machines (among a set of many identical machines) that are drifting most compared to a golden standard.
This is a whole new discussion and we will set it aside for another future article. Let’s assume we can capture sensor/process data associated with a set of machines as they operate. Moreover, this data is well-structured and clean. Even then, the solution to the drift detection is not trivial.
Let’s say,
- we collect 200 samples’ measurements per machine
- there are 10 process measurements (or sensors) – A, B, C, etc.
- there are 10 machines – _M_1, _M_2, _M_3, etc.
- there is, of course, a golden machine, with which we want to benchmark all these 10 machines and detect drift
So, we have a nice table of 20,000 data points (plus the golden machine data). But each measurement is different in nature and does not correlate to each other. The measurement from sensor ‘A‘ in machine ‘_M_1’ correlates to the measurement of sensor ‘A‘ from machine ‘_M_2’, or that from the machine ‘_M_3’. The following figure illustrates the idea.


Machine move, operations are executed, products are generated. But, most often, they are not digitally recorded. If they are not properly recorded (just in time), the associated data is lost to the Universe, contributing to the ever-increasing Entropy.
Not a trivial cognitive load either
Analytics dashboards are extremely popular in the manufacturing industry, especially among plant managers. They have their utility – in many situations – portraying a holistic picture of the health of the production process.
However, in this particular case, if you try to visualize every sensor measurement and find out the degree of correlation, you will encounter a maze of plots and nothing useful.

Brute-force visualizations are counter-productive in this case.
And, imagine, what would happen when 10 more machines roll into your factory? Or 10 more sensors are added.
At this point, you are starting to realize that more data is actually not helping. It is not a clean-cut ML problem where you have a labeled dataset of "machine drifts". Here, you have to carefully construct statistical tests and measures to detect drift and rank machines based on that measurement.
Again, like any other data-driven problem-solving case, you have options.
- you can extract descriptive statistics from each sensor data column and pairwise compare them to those from the other machines
- you can calculate some sort of global ‘distance score’ for each machine from the golden machine (here is a list of distance metrics that you can evaluate using the Scipy package)
- you can compute sophisticated distance metrics between individual data distributions (e.g., Mahalanobis distance) and rank based on them
To justify the naming of this article as a hands-on tutorial, let me show you a very simple method to extract pairwise correlation scores (i.e. matching sensor ‘A’ data of the golden machine to the sensor ‘A’ data of _M_1, _M_2, _M_3, and so on, then matching sensor ‘B’ data of the golden machine to the sensor ‘B’ data of _M_1, _M_2, _M_3, and so on).
A simple demo
The boilerplate code can be found here in my Github repo.
There is one golden DataFrame machine_golden
and a dictionary of 10 more DataFrames representing 10 machines, called machines
. Each DataFrame has 200 (sample) rows and 10 columns (sensor measurements).
We construct these DataFrames with synthetic data representing the datasets collected from various machines. Basically, we add variable Gaussian noise to the golden dataset and generate datasets for various machines. Some machines will drift apart from the golden data more than the others due to the variable nature of the noise (Gaussian mean and variance differs by machine).
Here is the sample code to do this,
machines = {'machine'+str(i):[] for i in range(1,11)}
for i in range(1,11):
loc = np.random.uniform(0,2)
scale = np.random.uniform(0,2)
df1 = machine_golden + pd.DataFrame(np.random.normal(loc=loc,scale=scale,size=(200,10)))
machines['machine'+str(i)] = df1
Now, off to the correlation analysis.
We can write manual code by slicing each column of data and using a correlation routine from Numpy to calculate the correlation scores. However, there is a much better and cleaner method available with Pandas. It’s a method called [DataFrame.corrwith](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corrwith.html)
that calculates Pearson correlation scores pairwise between columns of two DataFrames with just one line of code.
This is my tip for efficient data analytics. Look for compact and cleaner code alternatives. Often, they are perfectly suitable for your problem, and they yield a codebase that is clean and easy to debug. That leads to Productive Data Science.
The one-line code, that also prints out the mutual correlation, is,
for machine in machines:
print(f"Correlation of {machine} with the golden tester:", round(machine_golden.corrwith(machines[machine],axis=1).sum(),2))
Here we are doing column-wise correlation (axis=1
) and then summing the scores (.sum()
) along the row (we could have taken the mean too but since the number of samples is the same, that wouldn’t matter).

The result is something like,
Correlation of machine1 with the golden tester: 130.67
Correlation of machine2 with the golden tester: 91.78
Correlation of machine3 with the golden tester: 116.57
Correlation of machine4 with the golden tester: 178.85
Correlation of machine5 with the golden tester: 147.76
Correlation of machine6 with the golden tester: 150.91
Correlation of machine7 with the golden tester: 199.94
Correlation of machine8 with the golden tester: 192.48
Correlation of machine9 with the golden tester: 199.73
Correlation of machine10 with the golden tester: 97.73
Clearly, machine2 was least correlated with the golden machine, and machines 7, 8, and 9 are the most correlated.
Visual proof? If we plot some of the sensors’ data between the golden machines and machine2/machine7.


But we already knew that from the correlation analyses before and can pick out machines like machine2
and machine10
that show correlation scores below 100 (just an arbitrary threshold) for intervention.

The best part is that this extraction of drifting machines scales smoothly with sample frequency, number of sensors, or number of machines. There is no cognitive load associated with looking at every plot and deciding what to intervene.
Next week, the drift pattern may change and another machine may start drifting more than the others. We will catch it nonetheless using the same analytics approach. Moreover, we can tweak this approach to use mean correlation so that it can handle the situation where the number of samples is different for different machines (a fairly common scenario in production).

Summary
In this article, we show how we can handle a typical manufacturing data analytics problem of machine/tester drift and benchmark using very simple Python analytics tools.
The idea is to just show the possibilities so that engineers, working in the manufacturing sector or on Industry 4.0 initiatives, can think beyond the box and embrace data science tools and techniques in their analytics work.
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.