Exploring NASAs turbofan dataset
<disclaimer I aim to showcase the effect of different methods and choices made during model development. These effects are often shown using the test set, something which is considered (very) bad practice but helps for educational purposes.>
Welcome to another installment of the ‘Exploring NASA’s turbofan dataset’ series. In our last analysis we trained a Random Forest and set up hyperparameter tuning for FD003.
Today we’ll develop a model for dataset FD002. A dataset where the engines can operate on six unique operating condition, which was much more challenging to deal with than I had expected. For that purpose, I have chosen to write about this dataset after FD003. As, in my opinion, changing the order gives a more gradual increase in complexity. More complex datasets often require more complex solutions, this will be the first article in the series where we’ll train and tune a Neural Network (NN). This is also the longest article yet because there’s just so much to discuss, so let’s get started!

Loading data
First, we’ll import the required libraries.
Next, we’ll load the data and inspect the first few rows.

Looks good, let’s compute the Remaining Useful Life (RUL).

Exploratory Data Analysis (EDA)
Time to start exploring our data. I will skip the descriptive statistics for now, there isn’t much interesting to be found. We will, however, verify the presence of the six unique operating conditions.

As you can see, the six unique operating conditions are present. One operating condition occurs more frequently, but otherwise it’s pretty balanced.
Plotting
Previously I’ve plotted all sensor signals of multiple engines. Unfortunately, the current signals are rather incomprehensible. To get a clearer understanding of what’s going on I chose to plot just one sensor of one unit.

The signal is jumping all over the place and there is no trend to be seen like in the previous datasets. This dataset does include different operating conditions for the engines, let’s look at those next.

Setting 1 and 2 look quite similar so I left the graph of setting 2 out. Setting 3 behaves a bit different (see image below).

That explains a lot, I had expected different engines to run on different operating conditions, but apparently operating conditions change between cycles. Which makes analysing and predicting RUL much more complex. There is no use in plotting more signals for now, let’s continue with the baseline model.
Baseline model
The baseline model will be a simple linear regression without any feature engineering or selection. The only thing we’ll do is update our belief or RUL for the training set as explained in a previous post [4]. Essentially what we’re doing is instead of letting the computed RUL decline linearly we set the upper limit of computed RUL to 125. Clipping the RUL is a proven method that generates much better results compared to the naïve linear declining RUL [5].
# returns:
# train set RMSE:21.94192340996904, R2:0.7226666213449306
# test set RMSE:32.64244063056574, R2:0.6315800619887212
Our baseline model has a training RMSE of 21.94 and a test RMSE of 32.64, which will be the score to beat. Let’s continue preparing the first implementation of our NN.
Validation set
Neural networks are quite prone to overfitting. To get a better idea of model performance and overfitting before running it on the test set, it’s absolutely necessary to use a validation set.
Like last time we’ll split the data in such a way that all records of a single engine are either assigned to the training or validation set to prevent ‘data leakage’. For this purpose, we’ll use sklearns’ GroupShuffleSplit where our groups for splitting consist of the unit numbers.

The validation set should have a similar distribution in order to compare model performance between the train and validation set. Let’s check the target variable distribution to get a grasp of the similarity of the datasets.

Although the frequency is lower, the distribution of validation RUL looks similar to the training RUL.
Scaling
As with the Support Vector Regression implementation, NNs tend to use relative distances between data points, but are not very good in comprehending the magnitude of absolute differences. Therefore, adding scaling is a necessity.

Note, the train-validation split is exactly the same as the previous one. I’ve taken more precautions to ensure results are reproducible and comparable, which you can read about here.
Multilayer Perceptron (MLP)
The MLP is considered the ‘vanilla’ NN, suitable for learning non-linear patterns and consisting of at least an input, hidden and output layer [6]. Creating and training this type of NN in keras is relatively simple.
First, we define the input dimension, which is the number of features the NN expects. Then we create a model object by initiating the Sequential layer and adding a few Dense layers, specifying the number of nodes in each layer. The input dimension should be passed to the first Dense layer. The last layer in the model should be a Dense layer with a single node, since we’re expecting the model to return a single value (predicted RUL) for a single row of data.
Afterwards, we compile the model. At this stage the random weights are initialized. We also define the optimizer, which is responsible for updating the weights and the loss function, which is the parameter to optimize.
Next, we train the model. One additional argument is passed to the fit function, which is the number of Epochs. The number of Epochs represents how often the data is presented to the model for learning, in this case we’ll present the dataset 20 times.

When inspecting the training results, it’s important to look at the reductions in training and validation loss. When both losses are going down everything is going well, and the model is learning the patterns in the dataset. When the training loss is going down, but the validation loss is going up (steadily) the model is overfitting, which we’ll want to avoid. There is an easier way to interpret the loss though and that is by plotting it.

Note the validation loss sometimes peaks upwards but then decreases again. Indicating overfitting at that specific epoch, but in general the model is not overfit. We might even be able to train for a few more epochs. For now, let’s check how the model performs.
# returns:
# train set RMSE:22.393854937161706, R2:0.7111246569724521
# test set RMSE:34.00842713320483, R2:0.600100397460153
We now have a model whose performance comes near the baseline model. Time to start feature engineering.
One hot encoding
When I read the engines in this dataset would run on different operating conditions, I immediately thought of one hot encoding. Unfortunately, it didn’t work so well, but I will take you through my thought process.
One hot encoding is often used to encode a categorical variable into binary columns. In this case, each unique operating condition would be a ‘category’, see the image below.

Each row would only have one unique operating condition active, i.e. only one column will have a 1 in it, while the others will have 0. Appending this information to your data while removing the original settings makes it easier for the NN to figure out which unique combination of settings are being used and how they relate to the target variable.
Training a model with these new features did yield an improvement (train RMSE = 20.80, test RMSE = 32.01), but there are two problems with this implementation:
- The improvement due to one hot encoding is not nearly as large as I would have hoped
- This implementation leaves little room for additional feature engineering. For example, we can’t use any signal filters when the sensor signals deviate so much between operating conditions. In addition, adding lagged variables is impractical, as lags need to be included to account for changing operating conditions as well.
I’ve kept the code in the notebook for reference, you can find the link at the bottom of this article. Next, we’ll try a different method for dealing with the operating conditions.
Condition-based standardization
Previously we’ve tried min-max scaling the complete dataset at once. In condition-based standardization you group all records of a single operating condition together and scale them using a standard scaler. Applying this type of scaling will bring the mean of the grouped operating conditions to zero. Because you’re applying this technique for each operating condition separately, all signals will receive a mean of zero, bringing them somewhat to the same line and making them comparable [7]. This type of standardization works very well if the signals have similar behavior but are centered around a different mean. Please do note, this technique can only be used if the operating condition itself has no effect on the imminence of breakdown!
First, we’ll add the unique operating condition to the dataframe, similar to the first step of one hot encoding.
To verify the signals have different means but similar behavior we’ll plot the points of all engines for a single sensor, for each operating condition.

The plot confirms the signals indeed seem to have different mean values but otherwise similar behavior. So finally, we’ll add the condition-based standardization and plot the sensor signals as we would have normally done during EDA.

Sensor 1, 5, 16 and 19 look similar and don’t seem very useful at this point.

Sensor 2, 3, 4, 11, 15, and 17 show a similar upward trend and should be included for further model development.

Sensor 6, 10 and 16 look similar and don’t seem to hold any useful information after using this standardization technique.

Sensor 7, 12, 20 and 21 show a similar downward trend and should be included for further model development.

Sensor 8 and 13, almost look a bit like the standardization didn’t fully work towards the end, as the signal is still jumping up and down quite a bit. I think their trend is similar to those of sensor 9 and 14 (see below) just a bit noisier. We’ll keep them in for now.

Sensor 9 and 14 look similar to sensor 8 and 13, but less noisy. I would also consider these for further model development.

Last and also least (pun intended) is sensor 18, which remarkably seems to hold no information whatsoever after condition-based standardization. We should leave it out.
To summarize, the sensors which seem useable after condition-based standardization are:
We can now fit a new model and check the results.

The validation loss has reduced over 100 points which is a good sign, let’s evaluate to see the full effect of condition-based standardization.
# returns:
# train set RMSE:19.35447487320054, R2:0.7842178390649497
# test set RMSE:29.36496528407863, R2:0.7018485932169383
With a test RMSE of 29.364 we’ve already achieved an improvement of 10.04% over our baseline model. Let’s see if we can improve it even further.
Lagged variables
Like in the timeseries article, we can add lagged variables to provide the model with information of previous timecycles. Combining information of multiple cyles allows the model to detect trends and gives a more complete view of the data. See below example to grasp the essence of lagged variables.

Row 2 contains the data of the two rows before it. Rows where NA’s are introduced are dropped as the model cannot cope with these. Hence adding many lags greatly impacts the size of your dataset. Let’s train a model with lagged variables to see if performance improves.

Looks like validation loss has reduced by another 50 points. Let’s evaluate to see the full effect.
# returns:
# train set RMSE:17.572959256116462, R2:0.8236048587786006
# test set RMSE:29.075227321534157, R2:0.7077031620947969
We’ve gained a slight improvement. However, the model seems to start overfitting after a few epochs, we’ll try to combat this later.
A word on stationarity and smoothing
Originally, I tried making the data stationary. As explained in my post on timeseries, stationary data implies the statistical properties of a signal do not change over time. Due to this stationarity, statistical models will be able to predict future values in a robust manner.
You can use an adfuller test, to test if your data is stationary. If it isn’t, you can difference your data once to de-trend it and test again. By differencing your data, you will introduce NA’s, as the first row cannot be differenced with any previous row. Rows with NA’s have to be dropped, ultimately affecting how many lags we can add, which is the first point against making your data stationary in this scenario. Since some of the engines in the test set have a very limited number of rows and we’re already introducing NA’s by adding lagged variables.
Now, let’s check the stationarity of our data after condition-based standardization.

Although a slight trend is visible, this signal can be considered stationary right off the bat, which holds true for most, but not all, signals in the dataset.
Introducing smoothing
For smoothing I chose to implement an exponential weighted moving average. This smoothing function is quite simple yet very powerful. Essentially it takes the current value and the previous filtered value into account when calculating the filtered value (see simplified formula below) [7].
~Xt = a*Xt + (1 - a)*~Xt-1
Where ~Xt is the filtered value of Xt and α the strength of the filter. The value for alpha lies between 0–1. When alpha is 0.8, the filtered datapoint will be comprised of 80% of the value at Xt and 20% of the (already filtered) value at Xt-1. Hence, lower values for alpha will have a stronger smoothing effect.
Luckily, Pandas already has this type of filter implemented in its exponential weighted function [8]. Now let’s check the stationarity of our data after condition-based standardization and smoothing.

So, after smoothing, the signal is no longer considered stationary. Depending on the value of alpha (remember, lower values indicate stronger smoothing) the data would be considered less and less stationary as the general trend would become more apparent when the noise in the signal is smoothed out.
Reaching stationarity would thus require more and more differencing, resulting in a larger reduction of rows in your dataset. In an attempt to reach stationarity, alpha values below 0.69 would cause all rows of some engines in the test set to be dropped completely, something we want to avoid. However, the effect of filtering with values of alpha >= 0.69 isn’t that strong and therefore doesn’t improve model performance that much. In the end I chose to abandon stationarity, as stronger smoothing gave better model performance.
Let’s add smoothing to our data preparations and retrain our model to see where we are.

# returns:
# train set RMSE:15.916032412669667, R2:0.85530069539296
# test set RMSE:27.852681524828476, R2:0.731767185824673
Great, our test RMSE has reduced quite a bit more to 27.853. There is one more thing I’d like to test. Using the same model, we’ll crank up the epochs to see how the validation loss will behave. If it remains relatively constant for a higher number of epochs it means we can use more epochs in hyperparameter tuning.

# returns:
# train set RMSE:19.784558538181454, R2:0.7764114588121707
# test set RMSE:32.17059484378398, R2:0.6421540867400053
Test RMSE has increased significantly, but it should be okay to tune between 10–30 epochs, given we’ll also add drop-out to combat overfitting. We now have all the preprocessing components to start hyperparameter tuning.
Hyperparameter tuning
There are a lot of hyperparameters we can potentially tune, which I’ll explain below. To reduce complexity, I choose not to tune all of them.
- Alpha: the strength of our exponential smoothing average filter
- Lags: the previous timesteps to add to the row. I choose not to tune this parameter.
- Epochs: the number of times the data is fed to the NN. More epochs allows for better learning, but also introduces the possibility of overfitting.
- Number of layers in our NN, I choose not to tune this parameter
- Nodes per layer: the number of nodes in each layer
- Dropout: The fraction of node outputs in a layer which will be set to zero, dropout helps against overfitting [9]
- Optimizer & learning rate, I choose not to tune this parameter
- Activation function, activation functions can be considered as a form of post processing of layer outputs. For example, after computing the outputs the famous relu activation will set any negative output to zero.
- Batch size: The number of samples fed to the network before updating the weights [10]
We can’t use the hyperparameter tuning setup of our Random Forest from last time. Since the parameters defined above need to be tuned in different places, e.g. data, NN architecture and fitting, we’ll have to write some custom code.
Let’s start by defining the hyperparameters, functions to prep our data and create the model.
The data preparation and create model functions should look familiar as we’ve used this same logic earlier in this post.
Now, the code block for hyperparameter tuning is quite large. Check out the code first and read my explanation below.
Like last time, the type of hyperparameter tuning implemented here can be considered random search. 60 iterations should be enough to get you within 95% of the best solution [11], more iterations will get you closer to the best solution. I’ve set the number of iterations to 100. For each iteration we’ll randomly sample hyper parameter values from our lists of defined parameters. Based on these parameters, the dataset is pre-processed, and the model created. We’ll use the group shuffle split introduced earlier for cross validation, however for hyperparameter tuning we’ll use 3 splits.
After training we save the used parameters, the mean and standard deviation of the validation loss. The standard deviation of the validation loss gives an indication of how robust your model performs across the validation splits of your dataset. Let’s have a look at the output.

As you can see the overall parameters can highly influence model performance, with the best model having a mean validation loss of 171.19 and the worst model having a mean validation loss of 3640.33, which is 21 times worse!
The final model
We train our final model based on the best parameters.
# returns:
# train set RMSE:12.82085148416833, R2:0.9061075756420954
# test set RMSE:25.352943890689797, R2:0.7777536332595578
With an RMSE of 25.353 we’ve improved 22.33% over our baseline model, neat! This could be pushed a little further when also tuning the lags, number of layers, optimizer and learning rate.
For the complete notebook you can check out my github page here. I would like to thank Maikel Grobbe for his inputs and reviewing my article. Next time we’ll delve into the last dataset and an LSTM. Have any questions or remarks? Let me know in the comments below!
References: [1] https://keras.io/getting_started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development [2] https://stackoverflow.com/questions/32419510/how-to-get-reproducible-results-in-keras/59076062#59076062 [3] https://www.onlinemathlearning.com/probability-without-replacement.html [4] The importance of problem framing for supervised predictive maintenance solutions [5] F. O. Heimes, "Recurrent neural networks for remaining useful life estimation," 2008 International Conference on Prognostics and Health Management, Denver, CO, 2008, pp. 1–6, doi: 10.1109/PHM.2008.4711422. [6] https://en.wikipedia.org/wiki/Multilayer_perceptron [7] Duarte Pasa, G., Paixão de Medeiros, I., & Yoneyama, T. (2019). Operating Condition-Invariant Neural Network-based Prognostics Methods applied on Turbofan Aircraft Engines. Annual Conference of the PHM Society, 11(1). https://doi.org/10.36001/phmconf.2019.v11i1.786 [8] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ewm.html [9] https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#add_dropout [10] https://www.tensorflow.org/api_docs/python/tf/keras/Sequential [11] Zheng, Alice. Evaluating Machine Learning Models. O’Reilly Media, Inc. 2015