This article is derived from my bachelor’s thesis in Industrial Engineering titled "production data analysis with Data Science techniques". My thesis comes from the real industrial world, so I could use real industrial data, creating a case study useful for my thesis.
INTRODUCTION
I’ve studied a particular object which is manufactured in three phases: two assembly phases and a system operating test phase. The system has been mass production manufactured for three years, so the operators are well confident in their work. My analysis focused on the process lead time (*) of the assembly line and on the cycle time (**) of the two assembly phases. I haven’t considered the operating test cycle time in my analysis, because in the last months there were some changes and the data are still a little dirty.
THE STATISTICAL FOUNDATIONS OF THIS STUDY
In this case study, I’ve used the regression models to study the dataset; in particular: linear regression, polynomial regression, and spline regression. The idea was to subdivide the cycle time for both the assembly phases for each operator; this way the regression methods could help me find:
- the fastest operators for each assembly phase
- the constant operators for each assembly phase
How to understand if an operator is constant in his work? Mathematically speaking, we can say that a variable is constant when, in a chart, it is a vertical or a horizontal line; and this is where the regressions, especially the linear ones, helped me.
DATA ANALYSIS
Now, let’s talk about data to see how the data analysis has been made. But before going on, I’ve to say that the whole data analysis has been made using R.
PROCESS LEAD TIME ANALYSIS
First of all, I’ve studied the assembly line process lead time. I’ve imported the data using the "read excel" library in R. Charting a scatter plot, with the "observations" on the "x" axis and the "process lead times" on the "y" axis, gave me this result:

As we can see, there is a great data dispersion; in fact, I had to use a double logarithmic scale so that the chart could be plotted in certain dimensions. So, the first thing I had to face in this study was data cleaning. I asked myself: which values of the process lead time are too small? Which values are too high? Which criteria could I use to exclude the values that are too high and too little?
I decided to normalize the dataset using Gauss’ normal function, and this is the result:

As we can see, this is a Gaussian curve with right skewness; at this point, I’ve calculated the mean (m) and the standard deviation (s) and cleaned the data in the (m-s; m+s) range. After the data cleaning, I’ve plotted a scatter plot of the process lead time with a linear regression line and this is the result:

I found a negative slope line; what does it mean? It means that as the observations increase, the process lead time decreases; but why does it happen? There are two answers:
- the operators have worked many times on this system and they gain expertise
- an external modification has made the process lead time decrease
Since the system has been manufactured for three years, as I said at the beginning of this article, the correct answer to the decrease of the process lead time is the second one.
As I said before, in the last months some changes made me avoid studying the functional test cycle time; these changes affect the process lead time too, decreasing it.
So this is the first result: the process lead time decreases as the negative slope line, using the linear regression method, shows.
FIRST ASSEMBLY PHASE: CYCLE TIME ANALYSIS
First of all, I’ve to say that this assembly phase is a "standard phase". What does it mean that I’m talking about a standard assembly phase? I mean that the operations involved in are typical for an industrial assembly line; for example, in a standard one, we can find operations like screwing, mechanical and physical assembly, and so on. To better understand it, think about the fact that the cycle time depends on the confidence the operators can have about that phase (they did it hundreds of times), not about their manual skills. In the second assembly phase, we’ll see what is a non-standard assembly phase, so that it will be even more clear later on in this article.
I’ve identified the operators who contribute to at least 400 observations; the operators are four: operator A, D, N, O. Then, I’ve cleaned the data using the same method we’ve seen in the process lead time analysis, so I’m not going through it again. So, let’s see the data analysis for the first assembly phase.
THE LINEAR REGRESSION MODEL
The first thing I’ve done was a linear regression for the four operators, and this was the result:

As we can see from the graphs, the linear regressions appear to be horizontal lines, except for operator A; for this operator, the line has a little positive slope. Of course, this analysis is not sufficient to conclude anything. In fact – and this is the point of this study – a horizontal regression line in statistics has a clear meaning: there is no regression between the two variables; in other words, there is no connection between the observations and the cycle time.
Mathematically speaking we could stop here and say: "there is no regression" and maybe interrupt the study. But we are in an Engineering field, so I asked myself: "what does it mean if there is no connection between the observations and the cycle time?". The answer is pretty simple: for any observation, the mean cycle time is the same; in other words, the operators are consistent in their cycle time; and this is a great result!
Anyway, the study isn’t finished yet. I wanted to go deep inside the regression methods to show that those horizontal lines are truly horizontal. How to do that? First of all, I made a residual analysis. From the residual analysis, the "Multiple R-squared" values were all near to zero. This happens in two cases:
1) the linear regression model is not the best method to describe the data
2) the regression line is really horizontal
Furthermore, from the residual analysis, I could find the cycle time for each operator; in fact, R gives us the value of the intercept, and since the line is horizontal, this value is the actual cycle time. So, at this point, I decided to go deeper and try to analyze the data with the polynomial regression model.
THE POLYNOMIAL REGRESSION MODEL
The first question I asked myself was: what’s the best polynomial degree which perfectly fits the data? And how do I find it?
There are different ways to answer these questions:
- making residual analysis iteration with the polynomial degree increasing
- making graphical iterations with the polynomial degree increasing
There is no right or wrong in choosing one way or the other: in any case, it is a matter of iteration and deciding when to stop iterating; but when do we stop?
For the residual analysis, we can refer to the "adjusted R-squared" value; a good method is to study its value for each iteration; we should see it increasing for each iteration, and at a certain point it should decrease; when it decreases – especially near zero – it is a good sign that this is related to the polynomial degree that best fits the data.
Instead, for the graphical iteration, we just implement a polynomial regression with an increasing degree for each iteration and we graphically see which curve best fits the data; at a certain degree, the curves should all seem to be the same: we just choose the minimum degree related to the "all similar curves".
In this case study, I’ve used both methods, just to make some practice.
In the picture below, we can see the results of the iterations for the four operators:

By analyzing the graphs, I can say that for the operators N and O the curve is practically a line, presenting very small fluctuations, but these are at the domain’s border. The curve related to operator D seems doubtful, but also in this case the fluctuations are at the domain’s border. For operator A curve, instead, there is no doubt: the best approximation of the data is a curve (a 7th degree one).
Summarising: the best curve fitting the operator A data seems to be a 7th-degree polynomial. The best curve fitting the operators N and O data seems to be a horizontal line. For operator D there are some doubts about the curve: it could even be a horizontal line. For being more sure about these results, I’ve used the spline regression method.
SPLINE REGRESSION MODEL
For simplicity, I’ve used a 3rd-degree spline to study the data. I’ve decided to mathematically impose the splines knots at 25%, 50%, and 75% for each domain, along the x-axis, which is quite a standard choice.
This is operator A spline:

For the other three operators, instead, when trying to make a spline regression, R gave me an error and this is a good result; in fact, it means that no spline could interpolate the data. Thus, the curve that best fits the data it’s indeed a (horizontal) line.
This conclusion is the natural completion of the study for this assembly phase because, for these three operators, the polynomial regression method left some doubts about the curve that best fits the data. The splines, instead, have dispelled any doubts: since there is no spline, the curve is really a line for each operator.
SECOND ASSEMBLY PHASE: CYCLE TIME ANALYSIS
The methodology applied to the study of the second assembly phase is the same as before, so I’m just going straight to the results; but before, there are a couple of things to say.
This assembly phase is a non-standard one. In fact, in this phase, the operators manually apply a gasket to the overall external surface of the product. It is understandable how this phase is related to the operators’ manual skills.
So, having in mind the first phase study, the expectations here are to find that most operators show cycle time oscillations.
This means that, for most of them, the best model to describe the data is the spline one.
And, hence, I found that two operators out of four present oscillations in the cycle time, and these are their splines:


For the other two operators, the method that best describes the data is, of course, a horizontal line, meaning they remain constant in their cycle time.
CONCLUSIONS
Concluding, I want to highlight the importance of the method utilized in this study and to talk about some possible applications for this study.
THE METHOD: AN ANOMALY DETECTION ON THE PROCESS
The importance of this case study relies on the method, which is an anomaly detection on the process. In fact, for each operator, I’ve gone through the study of the linear regression – finding, at first, the cycle time; then, I’ve gone through the polynomial regression and spline regression to verify if an interpolating curve could be a line or an n-degree curve. Where the curve is a horizontal line, we can accept – as we are in the engineering field – that the residual analysis results near 0, meaning the line is horizontal in a certain range (could have a small positive or negative slope).
So, this method allowed me to find the fastest operators and/or the ones who are the most constant in their work, which are the ones who present a horizontal line in their cycle time.
POSSIBLE APPLICATIONS OF THIS STUDY
This method, combined with the operators skill matrix, can be useful to:
1) find the fastest and most constant operators for each product, based on data
2) have stand-in operators, based on data
3) study the Production times, while not attending in the production environment, with a couple of benefits:
3a) access to the whole available historical data, and not making assumptions on a small part of data (the ones registered while attending physically in the production environment).
3b) while attending in a production environment, the registered data can not be 100% clean. I mean that the operators can have some kind of emotional involvement, feeling kind of "oversaw" during their work.
So, the only available registered data in a production environment can give us just a partial picture of the real situation, and moreover this can be even wrong because of dirty data.
— – – –
I’d like to publically thanks Professor Carlo Drago for having supervised this study
— – – –
(*) with process lead time I mean the total time necessary to entirely manufacture a certain product. It is calculated since the product "enters" until it "exits" the assembly line
(**) with cycle time I mean the time necessary to entirely complete a manufacturing phase.
Let’s connect together!
LINKEDIN (send me a connection request)
If you want, you can subscribe to my mailing list so you can stay always updated!
Consider becoming a member: you could support me and other writers like me with no additional fee. Click here to become a member.