
Exploring NASA’s turbofan dataset
Although released over a decade ago, NASA’s turbofan engine degradation simulation dataset (CMAPSS) remains popular and relevant today. Over 90 new research papers have been published in 2020 so far [1]. These papers present and benchmark novel algorithms to predict Remaining Useful Life (RUL) on the turbofan datasets.
When I first started learning about predictive maintenance, I stumbled upon a few blog posts using the turbofan degradation dataset. Each covered Exploratory Data Analysis and a simple model to predict the RUL, but I felt two things were lacking:
- I never got a complete overview of how to apply different suitable techniques to the same problem
- The blog posts would only focus on the first dataset, leaving me guessing how the more complex challenges could be solved.
A few years later this seemed like a fun project for me to pick up. In a series of posts, I plan to showcase and explain multiple analysis techniques, while also offering a solution for the more complex datasets.
I’ve created an index below which I’ll update with links to new posts along the way:
- FD001 – Exploratory data analysis and baseline model (this article)
- FD001 – Updated assumption of RUL & Support Vector Regression
- FD001 – Time series analysis: distributed lag models
- FD001 – Survival analysis for predictive maintenance
- FD003 – Random forest (I’ve changed the order, read the article to find out why)
- Reproducible results primer for NNs in Jupyter Notebook
- FD002 –Lagged MLP & condition based normalization
- FD004 – LSTM & wrap-up
The turbofan dataset features four datasets of increasing complexity (see table I) [2, 3]. The engines operate normally in the beginning but develop a fault over time. For the training sets, the engines are run to failure, while in the test sets the time series end ‘sometime’ before failure. The goal is to predict the Remaining Useful Life (RUL) of each turbofan engine.

Datasets include simulations of multiple turbofan engines over time, each row contains the following information:
- Engine unit number
- Time, in cycles
- Three operational settings
- 21 sensor readings
What I find really cool about this dataset is that you can’t use any domain knowledge, as you don’t know what a sensor has been measuring. So, results are purely based on applying the correct techniques.
In today’s post we’ll focus on exploring the first dataset (FD001) in which all engines develop the same fault and have only one operating condition. In addition, we’ll create a baseline Linear Regression model so we can compare our modeling efforts of future posts.
Exploratory Data Analysis
Let’s get started by importing the required libraries, read the data and inspect the first few rows. Note that a few columns seem to have none to very little deviation in their values. We’ll explore these further down below.

Next, FD001 should contain data of one hundred engines, let’s inspect the unit number to verify this is the case. I chose to use pandas describe function so we can also get an understanding of the distribution. While we’re at it, let’s also inspect the time cycles to see what we can learn about the number of cycles the engines ran on average before breaking down.

When we inspect the descriptive statistics of unit_nr we can see the dataset has a total of 20631 rows, unit numbers start at 1 and end at 100 as expected. What’s interesting, is that the mean and quantiles don’t align neatly with the descriptive statistics of a vector from 1–100, this can be explained due to each unit having different max time_cycles and thus a different number of rows. When inspecting the max time_cycles you can see the engine which failed the earliest did so after 128 cycles, whereas the engine which operated the longest broke down after 362 cycles. The average engine breaks between 199 and 206 cycles, however the standard deviation of 46 cycles is rather big. We’ll visualize this further down below to get an even better understanding.
The dataset description also indicates the turbofans run at a single operating condition. Let’s check the settings for verification.

Looking at the standard deviations of settings 1 and 2, they aren’t completely stable. The fluctuations are so small however, that no other operating conditions can be identified.
Finally, we’ll inspect the descriptive statistics of the sensor data, looking for indicators of signal fluctuation (or the absence thereof).

By looking at the standard deviation it’s clear sensors 1, 10, 18 and 19 do not fluctuate at all, these can be safely discarded as they hold no useful information. Inspecting the quantiles indicates sensors 5, 6 and 16 have little fluctuation and require further inspection. Sensors 9 and 14 have the highest fluctuation, however this does not mean the other sensors can’t hold valuable information.
Computing RUL
Before we start plotting our data to continue our EDA, we’ll compute a target variable for Remaining Useful Life (RUL). The target variable will serve two purposes:
- It will serve as our X-axis while plotting sensor signals, allowing us to easily interpret changes in the sensor signals as the engines near breakdown.
- It will serve as target variable for our supervised machine learning models.
Without further information about the RUL of engines in the training set, we’ll have to come up with estimates of our own. We’ll assume the RUL decreases linearly over time and have a value of 0 at the last time cycle of the engine. This assumption implies RUL would be 10 at 10 cycles before breakdown, 50 at 50 cycles before breakdown, etc.
Mathematically we can use max_time_cycle - time_cycle
to compute our desired RUL. Since we want to take the max_time_cycle
of each engine into account, we’ll group the dataframe by unit_nr
before computing max_time_cycle
. The max_time_cycle is then merged back into the dataframe to allow easy calculation of RUL by subtracting the columns max_time_cycle - time_cycle
. Afterwards we drop max_time_cycle
as it’s no longer needed and inspect the first few rows to verify our RUL calculation.

Plotting
Plotting is always a good idea to develop a better understanding of your dataset. Let’s start by plotting the histogram of max RUL to understand its distribution.

The histogram reconfirms most engines break down around 200 cycles. Furthermore, the distribution is right skewed, with few engines lasting over 300 cycles.
Below I’ll show the code used to plot signals of each sensor. Due to the large number of engines, it’s not feasible to plot every engine for every sensor. The graphs would no longer be interpret-able with so many lines in one plot. Therefore, I chose to plot each engine whose unit_nr is divisible by 10 with a remainder of 0. We revert the X-axis so RUL decreases along the axis, with a RUL of zero indicating engine failure. Due to the large number of sensors I’ll discuss a few graphs which are representative for the whole set. Remember, based on our descriptive statistics, we should definitely inspect the graphs of sensors 5, 6 and 16.

The graph of sensors 1, 10, 18 and 19 look similar, the flat line indicates the sensors hold no useful information, which reconfirms our conclusion from the descriptive statistics. Sensors 5 and 16 also show a flat line, these can be added to the list of sensors to exclude.

Sensor 2 shows a rising trend, a similar pattern can be seen for sensors 3, 4, 8, 11, 13, 15 and 17.

Sensor readings of sensor 6 peak downwards at times but there doesn’t seem to be a clear relation to the decreasing RUL.

Sensor 7 shows a declining trend, which can also be seen in sensors 12, 20 and 21.

Sensor 9 has a similar pattern as sensor 14.
Based on our Exploratory Data Analysis we can determine sensors 1, 5, 6, 10, 16, 18 and 19 hold no information related to RUL as the sensor values remain constant throughout time. Let’s kick-off our model development with a baseline Linear Regression model. The model will use the remaining sensors as predictors.
Baseline Linear Regression
First, we’ll define a small function to evaluate our models. I chose to include Root Mean Squared Error (RMSE) as it will give an indication how many time cycles the predictions are off on average, and Explained Variance (or R² score) to indicate what proportion of our dependent variable can be explained by the independent variables we use.
We’ll drop the unit_nr, time_cycle, settings and sensors
which hold no information. The RUL column of the training set is stored in its own variable. For our test set we drop the same columns. In addition, we are only interested in the last time cycle of each engine in the test set as we only have True RUL values for those records.
Setting up for linear regression is quite straightforward. We instantiate the model by simply calling the method and assigning it to the ‘lm’ variable. Next, we fit the model by passing our ‘X_train’ and ‘y_train’. Finally, we predict on both the train and test set to get the full picture of how our model is behaving with the data that was presented to it.
# returns
# train set RMSE:44.66819159545453, R2:0.5794486527796716
# test set RMSE:31.952633027741815, R2:0.40877368076574083
Note, the RMSE is lower on the test set, which is counter-intuitive, as commonly a model performs better on the data it has seen during training.
A possible explanation could be the computed RUL of the training set ranging well into the 300s. Looking at the trend of the graph below, the higher values of linearly computed RUL do not seem to correlate very well with the sensor signal. Since RUL predictions of the test set are closer to failure, and the correlation between the lower target RUL and sensor signal is clearer, it may be easier for the model to make accurate predictions on the test set. The large difference in train and test RMSE can be seen as a flaw of our assumption of RUL and is something we’ll try to improve in the future. For now, we have concluded our EDA and baseline model.

In today’s post we explored the first dataset of NASA’s turbofan degradation simulation dataset and created a baseline model with a test RMSE of 31.95. I would like to thank Maikel Grobbe and Wisse Smit for their inputs and reviewing my article. In the next post we’ll look at how to improve the computed RUL to make predictions more accurate. In addition, we’ll develop a Support Vector Regression to push performance even further.
If you have any questions or remarks, please leave them in the comments below. For the complete notebook you can check out my github page here.
References: [1] Papers published on NASA’s CMAPSS data in 2020 so far: Google scholar search, accessed on 2020–08–08 [2] A. Saxena, K. Goebel, D. Simon, and N. Eklund, "Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation", in the Proceedings of the Ist International Conference on Prognostics and Health Management (PHM08), Denver CO, Oct 2008. [3] NASA’s official data repository