Exploring NASA’s turbofan dataset
<disclaimer I aim to showcase the effect of different methods and choices made during model development. These effects are often shown using the test set, something which is considered (very) bad practice but helps for educational purposes.>
In the last post we developed an MLP with lagged variables for FD002, a dataset in which the engines run on six different operating conditions. Today we’ll wrap-up the series and develop a Lstm for dataset FD004, in which the engines can develop two different faults in addition to running on multiple operating conditions. It’s the biggest challenge within NASA’s turbofan dataset yet, but we can use a lot of the building blocks from last time. Let’s get started!

Loading data
First, we’ll import the required libraries and set the random seeds, such that results will be reproducible. I’ve taken more precautions to create reproducible results, which you can read about here.
We’ll load the data next and inspect the first few rows.

Looking good. As usual, we compute the Remaining Useful Life (RUL) of the training set.

As in the last post, these engines are running on multiple operating conditions. Plotting won’t do us much good without some pre-processing. So, let’s create the baseline model before continuing our analysis.
Baseline model
For the baseline model we’ll train a linear regression on the available sensors and engine settings. We’ll set the upper limit of the computed RUL to 125, as it better reflects what we know about the engines RUL [1].
# returns
# train set RMSE:21.437942286411495, R2:0.7220738975061722
# test set RMSE:34.59373591137396, R2:0.5974472412018376
Our baseline model has a training RMSE of 21.43 and a test RMSE of 34.59, which will be the score to beat. Let’s move on to plotting to select our features.
Plotting
To make sense of the sensor values, we’ll apply the same operating condition-based standardization as last time. This form of standardization essentially applies a standard scaler to each group of datapoints which are running on the same operating condition, centralizing their means and ‘stitching’ the signal together. This form of standardization works for this dataset because the different operating conditions shift the mean of the signal (but doesn’t seem to affect the signal or breakdown in any other way).
We’ll first concatenate the values of the settings to get a single variable which indicates the operating condition. Then, we’ll iterate over the groups of operating conditions to fit the standard scaler on the train set and transform both the train and test set. We can now look at some plots.

Like last time, sensors 1, 5, 16 and 19 look similar but don’t seem very useful.

Sensor 2, 3, 4, 11, and 17 show a similar upward trend and should be included for further model development.

Sensors 6, 10 and 16 don’t reveal much trend, we’ll leave those out.

Sensor 7, 12, 15, 20 and 21 clearly show the distinction between the two faults being developed.

Sensor 8, 9, 13, 14 show similar patterns but with the addition of a fault condition, these signals might do more harm than good for model performance. As the signals don’t seem to distinguish between faults very well. Trying a model with and without these sensors will have to show whether these features should be included.

Sensor 18 doesn’t seem to hold any information after condition-based standardization. Time to start preparing for our LSTM.
LSTM
Explaining how an LSTM works is a bit out of scope for this blogpost, you can find excellent resources on the internet if you wish to read-up on the technique [2]. I will focus mostly on how to apply the algorithm. I’ve mainly chosen the LSTM for its capability to work with sequences, which is a useful way of arranging your data while working with timeseries. I’ll explain how to create sequences further down below.
Data preparation
We can re-use some of the functions from last time, like exponential smoothing and train-validation split.
Exponential smoothing is a simple yet powerful smoothing algorithm. See the simplified formula below [3].
~Xt = a*Xt + (1 - a)*~Xt-1
Where ~Xt is the filtered value of Xt and α the strength of the filter. Alpha can hold a value between 0–1. When alpha = 0.8, the filtered datapoint will be comprised of 80% of the value at Xt and 20% of the (already filtered) value at Xt-1. Values of alpha closer to 0 result in a more prominent smoothing effect.
It’s important for model validation that records of a single engine don’t get divided between train and validation sets. The LSTM may be able to predict correctly on the validation set by interpolation, which provides a false sense of the prediction error. When the model is fed truly unseen data it will perform much worse as it can no longer predict the RUL by interpolation. The train_val_group_split splits the data in such a way that all records of a single engine are assigned to either the training or validation set.
Next, we’ll discuss generating the sequences used by LSTM models.
Sequences
Ideally, you’d use something like TimeseriesGenerator from tensorflow to create your sequences, but there’s a few reasons why I opt to use custom code.
First, we need to account for unit_nrs. Sequences should only have data of a single unit_nr to prevent mixing records where failure is imminent with records of the next engine in the train set which is still running fine.
Second, timeseries usually use X timepoints to predict Yt+1, whereas I want to predict Yt. Predicting Yt instead of Yt+1 achieves two things:
(I) First and foremost, it’s in line with all previous data framings we’ve used throughout this short series and
(II) it allows us to feed a tiny bit more data to the algorithm. Where you’d naturally use your second to last record to predict the last target, we can now use the last record to predict the last target (see figure 1 below). Although, it’s only 1 additional record per engine, it can make a difference for very small subsets. For example, some engines in the test set only consist of a couple of records.

I have been inspired by code on Microsoft Azure’s github page to create the sequences [4] but have made some considerable changes. The code has been updated to implement the sequence generation on the right (see figure 1) and to include padding sequences of the test set (which I’ll explain further down below).
The example dataframe from figure 1 is used to showcase the effects of the sequence generating code.

The function gen_train_data was designed to receive a dataframe containing records of a single engine. When creating sequences of length=4, it is able to return 2 arrays (or sequences). One from index 0 up to and including index 3, and the other from index 1 up to and including index 4. The values of the returned sequences should make it easy to check them against the example dataframe (figure 1) and understanding what is happening.
Next, we’ll write a wrapper function to generate these sequences for multiple engines.

We’ll do something similar for the labels to predict.

Notice that the 3rd and 4th index values are returned, matching the sequences above. Again, the wrapper is able to generate the labels for multiple engines.

Generating the test data is a bit trickier. The original code would discard engines from the test set if they had less records than the desired sequence length [4]. Because models generally can’t cope with sequences which vary in length. By introducing padding, we can keep all the engines in the test set. Let’s say we want to create a sequence of length=5, but we only have 2 rows available. We can prefix (or pad) the missing 3 rows with some dummy values so our model will receive sequences of the desired length.

Later on we can program our model to ignore the dummy values, but let’s first implement a padding function for our example dataframe.

The example dataframe has 5 rows per engine, in the code above we want to create a sequence of length=6. As in the padding example, the available rows are returned as a sequence with dummy values prefixed, or padded, to get the desired sequence length. Note, I’ve put the mask value as a float to be in line with the other values in the dataframe.
We now have all the building blocks to train the first LSTM.
Model training
First, we combine all the preprocessing steps to prepare our data for modelling.
The initial model is defined with a single LSTM layer. The masking layer allows us to pass the padded dummy values without the model interpreting them. Next, we’ll compile the model and save its weights.
The model is recompiled, and its weights reloaded before training to ensure reproducibility [5].

Notice the training times are about 5 times higher compared to the MLP from last time. At this stage I would normally start looking into cloud computing (especially for hyperparameter tuning further down below), but for this series we’ll keep everything on a commodity laptop.

Looking at the train and validation loss everything seems to be fine. Let’s evaluate model performance.
# returns:
# train set RMSE:16.55081558227539, R2:0.8388066627963793
# test set RMSE:29.043230109223934, R2:0.7162618665206494
With an RMSE of 29.043 the current LSTM is already a 16.04% improvement over the baseline model.
There are two more things to check before hyperparameter tuning;
- (I) model performance without sensors 8, 9, 13 and 14 and
- (II) validation loss behavior when running more epochs
To test the model performance without sensors 8, 9, 13 and 14 we have to update our remaining_sensors variable
Because we remove a few sensors from the model inputs, we also have to re-instantiate our model to accommodate for the change in data shape. Re-instantiating our model means a new set of random weights will be initialized, making it more difficult to judge whether changes in model performance are due to the change in features or the random weights. No other changes have to be made so let’s re-run everything and check the results to see if we can draw any conclusions.


# returns:
# train set RMSE:17.098350524902344, R2:0.8279650412665183
# test set RMSE:29.36311002353286, R2:0.709977307051393
Both training and validation loss have increased a bit and the test RMSE has also increased by .3. It may be difficult to fully account this change to the removal of the sensors compared to the re-initialized model weights. To be sure we’ll include both sets of sensors in the hyperparameter tuning approach.
Like last time, let’s train once more with considerably increased epochs to view how the validation loss behaves

# returns:
# train set RMSE:15.545417785644531, R2:0.857795579414809
# test set RMSE:29.32099593032191, R2:0.7108086415870063
The validation loss seems quite stable, however, the models starts to overfit slightly after 15 epochs. We can play around with different epochs but shouldn’t push it too far.
Hyperparameter tuning
We’ll use a similar hyperparameter tuning setup as last time.
Parameters to tune are:
- alpha, filter strength
- sequence_length
- epochs
- number of layers
- nodes per layer
- dropout
- optimizer (I choose to not tune this parameter)
- learning rate (I choose to not tune this parameter)
- activation function
- batch size
- included sensors
Let’s define the possible values for the parameters.
With over 100k unique hyperparameter combinations, testing all of them would take ages. I’m opting for a random gridsearch as it’s much less time consuming with only a minor setback in overall performance [6].
Next, we’ll define the functions to prepare our data and create the model.
Finally, we define the number of iterations and the main code for running the hyperparameter tuning.
Note, for the GroupShuffleSplit the number of splits has been set to 3, meaning we’ll train and crossvalidate each unique combination of hyperparameters 3 times. The mean and standard deviation of the validation loss are saved alongside the hyperparameters for that specific iteration.
I’ve run the random gridsearch multiple times with different variations (e.g. with early stopping and different ranges for the parameters). The last tuning job was halted after 55 iterations as my laptop slowed to a crawl, however the best result seemed to be rather consistent, with a validation loss slightly above 200 MSE. Let’s have a look at the result.


As you can see, iterations without sensors 8, 9, 13 and 14 didn’t make it to the top 5, giving stronger evidence that it is better to keep those sensors included. Let’s retrain our model with the best performing hyperparameters and check the result.
The Result

# returns:
# train set RMSE:12.35975170135498, R2:0.9112171726749143
# test set RMSE:25.35340838205415, R2:0.7837776516770107
The final test RMSE is 25.353, which is a 26.71% improvement over our baseline model. When compared to the literature (see overviews in [3] and [8]), results seem to be on par with those from 2017/2018 state of the art approaches. Which I think is quite neat given the solution isn’t overly complex. Running more iterations and also tuning the optimizer and learning rate could probably push this a little further.
Reflections & wrap-up
Looking back at this short series, I had a lot of fun analysing the datasets and developing the models. It was really interesting for me to see how the change in problem framing, by clipping the linearly computed RUL to an upper limit, arguably resulted in the largest improvement in model accuracy. Showcasing the importance of framing your problem correctly. In addition, it was a personal win to learn and apply survival analysis to a predictive maintenance dataset, as I had not encountered a good and clear example of that technique for predictive maintenance.
I hope this series gave you a good introduction into (some) predictive maintenance methods. The write-up, examples and explanations would have definitely helped me when starting out. If you want to improve further, I’d suggest reading a few papers on more complex preprocessing and neural network architectures, as well as changing the error measure from RMSE to one which penalizes late predictions (see [3] for an example on a different error measure). Another interesting approach would be to calculate/predict the asset health index [8–10]. The idea is to have a single, abstract, KPI informing on the overall health of the equipment, you could break this down further by specifying each sensor’s contribution to the overall score.
Last but not least, incorporating domain knowledge greatly improves your ability to come up with a fitting solution. Try to understand the mechanics at play and talk to experts to get pointers you may have overlooked or couldn’t understand based on the data alone.
You can find the full code on my github page here. I would like to give a big shout-out to Maikel Grobbe, for having numerous discussions with me and reviewing all my articles. In addition, I’d like to thank you for reading and as always, if you have any questions or remarks, please leave them in the comments below!
References [1] F. O. Heimes, "Recurrent neural networks for remaining useful life estimation," 2008 International Conference on Prognostics and Health Management, Denver, CO, 2008, pp. 1–6, doi: 10.1109/PHM.2008.4711422. [2] http://colah.github.io/posts/2015-08-Understanding-LSTMs/ [3] Duarte Pasa, G., Paixão de Medeiros, I., & Yoneyama, T. (2019). Operating Condition-Invariant Neural Network-based Prognostics Methods applied on Turbofan Aircraft Engines. Annual Conference of the PHM Society, 11(1). https://doi.org/10.36001/phmconf.2019.v11i1.786 [4] https://github.com/Azure/lstms_for_predictive_maintenance [5] Primer on developing reproducible Neural Networks in Jupyter Notebook [6] Zheng, Alice. Evaluating Machine Learning Models. O’Reilly Media, Inc. 2015 [7] J. Li, X. Li and D. He, "A Directed Acyclic Graph Network Combined With CNN and LSTM for Remaining Useful Life Prediction," in IEEE Access, vol. 7, pp. 75464–75475, 2019, doi: 10.1109/ACCESS.2019.2919566. [8] https://www.researchgate.net/profile/Jianjun_Shi/publication/260662503_A_Data-Level_Fusion_Model_for_Developing_Composite_Health_Indices_for_Degradation_Modeling_and_Prognostic_Analysis/links/553e47d80cf20184050e16ea.pdf [9] http://www.pubmanitoba.ca/v1/exhibits/mh_gra_2015/coalition-10-3.pdf [10] https://www.wapa.gov/About/the-source/Documents/AMtoolkitTSSymposium0818.pdf