Predicting vehicle fuel efficiency

Comparing XGBoost — using GridSearchCV — and PyTorch neural networks to fit a regression model and predict mileage from car characteristics

Ranganath Venkataraman
Towards Data Science

--

Photo by Mika Baumeister on Unsplash

My previous posts focused on the application of machine learning and neural networks in datasets from the oil industry. These efforts gave me the confidence to successfully implement XGBoost and artificial neural networks to predict a major safety hazard at the oil refinery where I work. A major success that I’m very proud of!

In order to exercise some versatility, I wanted to use PyTorch and GridSearchCV in this application and first wanted to practice on a simpler and smaller dataset — namely the Auto MPG dataset from the UCI repository.

As before, I’m laying out a step-wise approach to assure systematic application:

Step 1 = Understand the goal and big picture before diving into modeling.

Step 2 = load and get high-level insight into data

Step 3 = further visualization to explore the data. Goal: get an early inkling of the most leveraging features.

Step 4 = plan for modeling. The key here is to get data ready for training and testing models.

Step 5 = execute modeling and evaluate performance. Complete sensitivity analysis as desired

Step 6 = draw conclusions from work to benefit the end-user

Snippets of code are below with full code in GitHub repo.

Step 1: our goal is to predict mileage for cars based on various characteristics. Through the modeling effort, we can understand what makes a car efficient and how to achieve target mileage consumption.

Step 2: reading the input data into Python and using .describe(), .head() functions to get the key statistics and a first look at the data. I also want to name the columns, since this wasn’t available in the raw data.

Figure 1: first look at data, with some high level insights

High-level takeaways:

  • The dataset has variables covering a wide range of numbers; if we use algorithms like kNN, this suggests the need for scaling.
  • Seems that we have a mix of continuous and discrete variables, one of which is non-numeric. This suggests the need for some kind of encoding.
  • There are 398 entries with average mileage of 23 mpg, 3 possible origins, 5 possible model years.

A separate check — not shown here — shows that the dataset has 6 empty values for the ‘horsepower’ feature.

Step 3a) plotting correlations as a Seaborn heatmap

Figure 2: Seaborn heatmap of Pearson correlation coefficients

Conclusions to draw from above?

  • Car weight and displacement have the strongest inverse correlation with mileage. Lines up well with intuition that the big Hummer isn’t the most efficient user of gasoline
  • Horsepower and number of cylinders are also strongly inversely correlated with mileage — again lines up well with the intuition that a fast sports car needs more gasoline than a sedan
  • Car origin and model year are categorical numerical variables — let’s visualize these using barplots.

For modeling purposes, I’ll first use all these features. However, I could drop either cylinders/displacement/car weight (given high correlation coefficients) if I wanted a more compact set of features to prevent any potential overfitting.

Step 3b) Let’s use barplots to learn more about the categorical numerical features:

Figure 3: histograms of car origin and car model year

Conclusions to draw from Figure 3 histograms: cars in origin 1 are far more represented, suggesting that mpg results are better suited to that origin — whatever origin may stand for e.g. country of manufacture. We notice that model years are distributed over 12 years, with 1970, 1976, 1982 more represented than others.

Should these numerical categorical variables be encoded?

Figure 4: barplots of car origin and model year vs. mpg

Conclusions from Figure 4 barplots: there is marginal increase in mpg with car origin number. The correlation is less so, for model year, but for now, I won’t encode these variables. I’m fine with leaving these features in as-is.

Step 3c) The analysis above got me curious about the dependent variable i.e. the target i.e. what we’re trying to predict. Is that skewed in any direction? Figure 4 below shows right-skewness of the feature, suggesting that we might want to do a log transform of the feature data for use in modeling.

Figure 5: histogram of mpg

Step 4: in addition to doing some more data preparation and creating test/ train sets for modeling, I’ll also set up GridSearch CV

4a) imputing missing data and creating training and testing sets. Also encoding car make, dropping car model — since the latter is extraneous detail.

# The code below one-hot encodes car make and drops car model.Data['car make'] = Data['car name'] 
Data['car make'] = Data['car name'].apply(lambda x: x.split()[0])
Data.drop(columns=['car name'],inplace=True)Data = pd.get_dummies(Data,columns=['car make'])# Next: creating x and y belowx_no_log = Data.drop(columns=['mpg'])
y_no_log = Data['mpg']
# Imputing missing car horsepower values.imp = SimpleImputer(missing_values=np.nan,strategy='median')
x_no_log['horsepower'] = imp.fit(x_no_log['horsepower'].values.reshape(-1, 1)).transform(x_no_log['horsepower'].values.reshape(-1, 1))

4b) setting up GridSearchCV — this allows us to loop through hyperparameters to find the optimal combination per the selected scoring metric (mean square error in my case). The GitHub repo has the similar exercise for RandomForest.

xgb_params = {'nthread':[4], 
'learning_rate': [.03, 0.05, .07],
'max_depth': [5, 6, 7],
'min_child_weight': [4],
'subsample': [0.7],
'colsample_bytree': [0.7],
'n_estimators': [500,1000]}

4c) setting up PyTorch — the code below imports the necessary packages, sets up the neural network with input dimensions, number of nodes in hidden layer, and output node, and then builds the model.

I’m also declaring mean squared error as the loss metric to optimize, same metric that I’ll use when evaluating XGBoost and RandomForest’s performance on test set.

Finally, setup the Adam optimization algorithm for maneuvering the gradient descent function as seen in Figure 6 below.

Figure 6: excerpt of code with PyTorch

5) With all setup complete, time to actually run the models and evaluate results.

5a) I’ll first use GridCV and the previously setup hyperparameter grid to find the best performing XGBoost model (similar exercise for RandomForest is in GitHub repo) based on performance in the training set.

5b) I’ll then train the created PyTorch neural network

gsXGB = GridSearchCV(xgbr, xgb_params, cv = 7, scoring='neg_mean_squared_error', 
refit=True, n_jobs = 5, verbose=True)
gsXGB.fit(xtrain,ytrain)
XGB_best = gsXGB.best_estimator_train_error = []
iters = 600
Y_train_t = torch.FloatTensor(ytrain.values).reshape(-1,1) #Converting numpy array to torch tensorfor i in range(iters):
X_train_t = torch.FloatTensor(xtrain.values) #Converting numpy array to torch tensor
y_hat = torch_model(X_train_t)
loss = loss_func(y_hat, Y_train_t)
loss.backward()
optimizer.step()
optimizer.zero_grad()

5c) With model training complete, I can now evaluate performance on testing data.

# Evaluating best-performing XGBoost model on testing data
ypred = XGB_best.predict(xtest)
explained_variance_score(ytest,ypred)
mean_absolute_error(ytest,ypred)
mean_squared_error(ytest,ypred,squared=True)
# Evaluating PyTorch model on testing data
X_test_t = torch.FloatTensor(xtest.values) #Converting numpy array to Torch Tensor.
ypredict = torch_model(X_test_t)
mean_squared_error(ytest,ypredict.detach().numpy(),squared=True)

Mean Squared Error of the XGBoost model arrived at through hyperparameter tuning is 0.0117 mpg. Given a mean of 23.5 mpg in original dataset, this can be interpreted as accuracy of 99.9%

Mean Squared Error of PyTorch neural network is 0.107 mpg. Using the approach above, this can be translated to an accuracy of 99.5%.

6) What have we accomplished for our client? We have a model that predicts fuel mileage for a variety of cars; our client can use this to plan for cars that achieve desired levels of fuel efficiency.

Additionally we can also inform our client that — per Figure 7 below — weight is the most influential variable in predicting mileage, with acceleration being second most. Horsepower, displacement, and acceleration are relatively close to each other in impact.

Figure 7: feature importance in predicting mileage

With this detail, our client can plan future car production or purchase plans.

As always, I welcome any feedback.

--

--

Chemical Engineer, Data and Machine Learning Enthusiast who’s exploring the energy industry with new tools