Exploring NASA’s Turbofan dataset
<disclaimer: I aim to showcase the effect of different methods and choices made during model development. These effects are often shown using the test set, something which is considered (very) bad practice but helps for educational purposes.>
Welcome to another installment of the ‘Exploring NASA’s turbofan dataset’ series. In the last post we looked at survival analysis and wrapped up our analysis on dataset FD001. Although the final RMSE wasn’t as good compared to the earlier models we created, it’s a very interesting technique to keep in mind as it can deal with censored data. Today we’ll dive into the third dataset (FD003), which is characterized by the engines having two possible fault modes.

I have chosen to switch the order in which to post on FD002 and FD003 as, in my opinion, this order gives a more gradual increase in complexity. At first, I had started fitting a Random Forest (RF) on FD002, but results weren’t all that good. FD002 looks like it really requires more complex preprocessing and modelling to deal with the various operating conditions. On FD003 however, I think the RF will perform pretty well, as it’s naturally capable of differentiating between the fault modes. Let’s dive in and find out!
Exploratory Data Analysis (EDA)
We can repeat a lot of the steps from our EDA on FD001. First, let’s read in the data.
# returns
# (24720, 26)

Looks good, next we will inspect some descriptive statistics.

Our dataset consists of 24.7k rows and 26 columns, the first engine fails after 145 time_cycles while the last engine fails after 525 time_cylces. Next, we’ll inspect the sensor descriptives.

Judging by the standard deviation of (almost) zero, sensors 1, 5, 16, 18 and 19 hold no valuable information.
Before we start plotting our data, let’s compute the Remaining Useful Life (RUL) of the train set.

Now that that’s added, we’re going to inspect the distribution to get an even better understanding than we could from inspecting the descriptive table.

The RUL is clearly right skewed and doesn’t have much of a tail on the left side at all. The skewed distribution can have a big impact on model performance. After EDA is complete, we’ll clip the computed RUL to an upper limit of 125 for model training, as it gives a better representation of our knowledge of RUL for the train set [1, 2].
Plotting signals
Next, we’ll inspect the sensor values to see if we can visually distinguish between the different fault modes and to identify sensors to abandon for model development.

Sensors 1, 5, 16, 18 and 19 all look similar. We can reconfirm their exclusion as they don’t seem to hold any information.

Sensors 2, 3, 4, 8, 11, 13 and 17 show a similar upward trend and should be included during model development.

Sensor 6 is a bit of an odd one but gets the benefit of the doubt.

Sensors 7, 12, 15, 20 and 21 clearly show the two fault modes and should definitely be included in the model.

Sensors 9 and 14 show a similar trend but don’t differentiate between fault modes very well. Testing their effect on model performance will have to point out if they should be included or not.

Lastly, sensor 10, also an odd one which gets the benefit of the doubt as there seems to be somewhat of an upward trend. With our EDA complete it’s time to create our baseline model.
Baseline model
Like we did for FD001, I like to start with one of the simplest models possible, a regression model with no additional preprocessing.
# returns
# train set RMSE:19.33013490506061, R2:0.7736110804777467
# test set RMSE:22.31934448440365, R2:0.7092939799790338
The test RMSE for the baseline model is 22.319, which will be our score to beat. Next is our first try at the Random Forest Regressor.
Random Forest
One of the key strengths of the Random Forest (RF) over single decision trees is their ability to generate varied trees. Let me explain. When creating a single decision tree, the algorithm tries to create a decision node based on a single feature (of all available features) which best splits your dataset. For the next node it will re-examine all available features to create the following best split. If you would fit a decision tree a second time under these conditions, it would make the exact same tree. A RF however, only considers a subset of all features when creating a split. This forces the algorithm into making different trees, as the feature for creating the best split may not be available. Potentially coming up with combinations of splits which perform better than the single best split of a regular decision tree.
Although the description above is how I learned RF works, it’s important to check if the tool you use also implements it in the same way.
![Part of RF documentation from scikit-learn v0.22.2 [3]](https://towardsdatascience.com/wp-content/uploads/2020/11/1bfGjgpQyrcMFysS20VuKTg.png)
![Part of RF documentation from scikit-learn v0.22.2 [3]](https://towardsdatascience.com/wp-content/uploads/2020/11/1Us9Cc2-NqUc8kktW8G8CZw.png)
Examining the documentation of sklearns RandomForestRegressor shows it considers all features by default, essentially creating the same tree over and over again. Therefore, we specify the max_features as the square root of the available features. In addition, we set the random_state so trees are always generated in the same way. Otherwise the random tree generation will affect model results, making it difficult to judge if a model performs better because we changed some features or due to randomness.
# returns
# train set RMSE:5.9199939580022525, R2:0.9787661901585051
# test set RMSE:21.05308450085165, R2:0.7413439613827657
Although the RF already performs a little better than our baseline model. Judging by the difference in RMSE and variance between the train and test sets the model seems quite overfit. Let’s inspect some tree characteristics to verify my suspicion.
# returns
# 33
# array([15616, 11694, 7793, ..., 1, 1, 4], dtype=int64)
This tree’s longest path consists of 33 of nodes, more than double the number of features we put in. When looking at the n_nodes_samples, we can see the final leaves of the tree contain very few samples each. The tree has become so specific, it created splitting criteria until most samples can be distinguished, which is really bad for generalization (hence overfitting on the training set). We can try to fix this by setting the max_depth and min_samples_leaf of the RF.
# returns
# train set RMSE:15.706704198492831, R2:0.8505294865338602
# test set RMSE:20.994958823842456, R2:0.7427702419664686
Playing around with those settings for a bit reduced overfitting while gaining a slight performance increase. Fitting this same model without s_6 and s_10 performed worse, so those sensors are kept in. We’ll use this model as a base for further improvements.
Visualize RF
We can visualize one of our trees in an attempt to identify points of improvement [4,5].

There seem to be some nodes which result in very inaccurate predictions (mse > 500). Remember sensors 9 and 14, which don’t differentiate between fault modes very well? They show up as splitting criterion in this part of the tree quite a bit, but results remain lackluster. Let’s try and fit a RF without those two sensors and check its performance.
Rerunning the crudely tweaked RF returns:
# train set RMSE:17.598192835079978, R2:0.8123616776758054
# test set RMSE:22.186214762363356, R2:0.7127516253047333
Unfortunately, performance becomes quite a bit worse, so we’ll keep sensors 9 and 14 included.
A word on feature engineering
There aren’t a lot of possibilities for feature engineering for this particular dataset – algorithm combination. Random Forests naturally excel at learning complex data patterns and are invariant to scaling or feature transformations [6]. Since all features are numeric already, there isn’t much more we can do.
I’ve tried smoothing the data with a simple moving average. In theory this would make it easier for the algorithm to apply its splits correctly and make it easier to distinguish between faults as noise is removed from the signal. But unfortunately, performance didn’t increase. The code for smoothing the sensor signals can be found in the notebook (link at the bottom of the post). Next, we’ll delve into hyperparameter tuning.
Hyperparameters
What parameters can we tune?
# returns
{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'mse',
'max_depth': 8,
'max_features': 'sqrt',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 50,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
For a full description of all parameters, please refer to the official documentation [3]. The biggest challenge in fitting Random Forests is overfitting. The parameters max_depth, min_samples_leaf, ccp_alpha and min_impurity_decrease help reduce overfitting and generate overall better performing models. Therefore, I’ve picked those for model tuning.
Max_depth and min_samples_leaf should be quite self-explanatory, but ccp_alpha and min_impurity_decrease need a bit more explanation.
ccp alpha
Cost complexity pruning alpha is a parameter used for pruning trees. Pruning is the removal of nodes after fitting, so essentially ccp_alpha is an alternative to using min_samples_leaf and max_depth to prevent overfitting.
The cost complexity of the nodes can be retrieved from a fitted tree. Lower ccp_alpha’s indicate higher cost complexity. By removing nodes with small ccp_alpha’s the tree is pruned, and overall complexity reduced [7].
To get an indication of what range of ccp_alpha’s to use for hyperparameter tuning it’s best to visualize the effective alpha versus the impurity of the leaves. Note, the below analysis is from a single tree of the RF.

The cost complexity of the tree really increases when the effective alpha drops from ~70 to a bit more than 20, but for lower values of alpha the effect of cost complexity on leaf impurity is difficult to determine, let’s zoom in a little bit.

When effective alpha drops from 2 to 0 (and cost complexity reaches its maximum) the leaf impurity seems to decrease by about 50 points, which amounts to ~7 training RMSE. Given the extreme overfitting of our first RF, this seems a suitable range for hyperparameter tuning.
min impurity decrease
Min impurity decrease is a measure for indicating the reduction in error after a split. The impurity decrease is a weighted value calculated as follows:
![Formula for calculating the impurity decrease [3]](https://towardsdatascience.com/wp-content/uploads/2020/11/1sWhgXtw_GGO-54npvSSLjg.png)
Impurity decreases are normally calculated for a single decision tree. So, let’s first extract the required data of a single tree.
# returns shape (227, 5)

When looking at the first few rows you should notice some child_id’s have the value -1. This indicates the parent node was a leaf node and so no further splits were made.
With all the data in a dataframe we can calculate the min_impurity_decrease using the formula above.
It’s best to visualize the result to get an idea of suitable values for min_impurity_decrease.

The impurity decrease is very right skewed, this can be explained by the first few nodes of the tree contributing a lot to decreasing error. Checking the descriptives shows 25% of the values for impurity decrease lie below 14.59, which seems a suitable upper-bound for the min_impurity_decrease parameter.

Randomsearch
After finishing exploring the parameters to tune, we can now set the appropriate ranges to evaluate.
# returns
1571724
Note, if we wanted to test all possible combinations, we’d have to fit over 1.5 million models. A tedious task. Luckily, applying Randomsearch by randomly picking 60 unique combinations should get you within 95% of the optimal solution [8].
Increasing the number of iterations increases the probability of finding a better solution. In addition to using Randomsearch, I prefer to keep the number of trees low to speed up training times. This combination allows for a relatively fast search.
I’ve executed the search a couple of times and chose the best performing settings. There’s one more thing to discuss before showing the code: We need to create validation sets to validate the chosen hyperparameters.
Validation
Creating a validation set for this data requires taking into account one important factor. Engines which are included in the training set cannot be included in the validation set and vice versa.
Normally you would create a random split in your data where 80 % belongs to the training set and 20 % to the validation set. However, if we split randomly without taking the unit numbers into account we might end up with part of the data of a single engine in both train and validation sets. The model could then learn to extrapolate between timesteps and make very accurate predictions on the validation set. On truly new data however, model performance would suffer.
To prevent this form of ‘data leakage’ we must make sure all records of a single engine are assigned to either the training or validation set. To achieve this form of data splitting we’ll use GroupKFold.

Since training times are relatively low, I’ve set the number of iterations to 300. Next, the RandomizedSearchCV is instantiated with a bare RF, the parameters to sample from for hyperparameter tuning and the GroupKFold, where groups are based on unit_nr. The random search takes slightly less than 15 minutes, we can convert the results to a dataframe for further inspection.

Inspecting the results helps to obtain an understanding of which hyperparameters perform well and could potentially be used to refine your search space (we’ll leave it as if for now).
Unfortunately, results aren’t reproducible, meaning restarting the kernel and running the notebook again doesn’t produce the same results. Something I will try to tackle in the next analysis. Having executed the search a couple of times, the best performing set of parameters I’ve found was:
{'min_samples_leaf': 11, 'min_impurity_decrease': 0.0, 'max_depth': 15, 'ccp_alpha': 0.125}.
With a mean_test_score of -16.577.
Final model
Using those parameters, we can train our final model.
# returns:
# train set RMSE:13.95446880579081, R2:0.8820190156933622
# test set RMSE:20.61288923394374, R2:0.7520472702746352
And there you have it, a tweaked Random Forest without feature engineering (only some feature selection) but still very respectable results. The test RMSE of 20.612 provides a 7.65% improvement over our baseline model. It may not seem like much, but in terms of RMSE this RF beats the timeseries model we’ve fit earlier on FD001 (which had an RMSE of 20.852), while FD003 is a more complex dataset!
For the complete notebook you can check out my github page here. I would like to thank Maikel Grobbe for his inputs and reviewing my article. Next time, we’ll delve into the even more complex FD002 and neural networks using Tensorflow. Have any questions or remarks? Let me know in the comments below!
References [1] The importance of problem framing for supervised predictive maintenance solutions [2] F. O. Heimes, "Recurrent neural networks for remaining useful life estimation," 2008 International Conference on Prognostics and Health Management, Denver, CO, 2008, pp. 1–6, doi: 10.1109/PHM.2008.4711422. [3] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html [4] https://towardsdatascience.com/how-to-visualize-a-decision-tree-from-a-random-forest-in-python-using-scikit-learn-38ad2d75f21c [5] https://stackoverflow.com/questions/14784405/how-to-set-the-output-size-in-graphviz-for-the-dot-format [6] https://en.wikipedia.org/wiki/Random_forest [7] https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html [8] Zheng, Alice. Evaluating Machine Learning Models. O’Reilly Media, Inc. 2015