
The family of linear models includes ordinary Linear Regression, Ridge regression, Lasso regression, SGD regression, and so on. The coefficients of linear models are commonly interpreted as the Feature Importance of related variables. In general, feature importance refers to how useful a feature is at predicting a target variable. For example, how useful _age_of_a_house_ is at predicting house price.
This article summarized and explained four often neglected but crucial pitfalls in using coefficients of linear models as feature importance:
- Standardized dataset or not
- Linear models have different opinions
- Curse of highly correlated features
- Stability check with cross-validation
Linear regression models take the form

where 𝒚 denotes the targeted value of prediction, 𝑤 are the coefficients, and 𝑥 are the variables or features.
For the sake of simplicity, let’s use a naive example: predict _price_of_ahouse based on 3 features _age_of_a_house, num_ofrooms, and _squarefeet.

The dataset looks as follows:

Assume that we feed the dataset to a linear model, then train the model, make a prediction, at this point, we have the values of coefficients:

Now, we come to the original question: What is the feature importance of our features: _age_of_a_house, num_ofrooms, and _squarefeet? How useful they are in predicting _price_of_ahouse? Can we say that their importance is equal to the coefficients: [20, 49, 150]?
Standardized dataset or not
The answer is: ONLY IF the dataset was standardized before training, the coefficients can be used as feature importance.
For example, if we applied a standard scaler to the raw dataset, then fit it to the model, we can say that the feature importance of _age_of_ahouse is 20.
The reason is that variables often sit at different scales. The _num_of_rooms mostly is in the range of [1,10] while square_feet_ can be in [500, 4000]. In this case, variables need to be scaled to the same unit of measure. Coefficients are feature importance of linear models if the dataset was standardized.
Linear models have different opinions
Different linear models could have totally different opinions about feature importance. In above example, an ordinary linear model has coefficients [20, 49, 150]. A ridge regression model could have coefficients [1820, 23, 90] that significantly vary from ordinary regression. In practice, I would suggest using an ensemble strategy to combine the ideas from different models.
Curse of highly correlated features
The variables _num_of_rooms and square_feet_ are correlated. More rooms you can expect in a bigger house. Unfortunately, we can not just simply remove one of them. The effects of correlated variables, especially collinear ones, cannot be teased apart easily. Also, the highly correlated variables could induce unstable values of coefficients when changing the input dataset.
Stability check with cross-validation
To check the stability of coefficients, a typical approach is to apply cross-validation and track values of coefficients across the folds in the loop. Be mindful that if the coefficients change significantly between folds, we should be cautious of using them as feature importance.
In this article, we use simplified house price prediction as an example, to explain four often neglected but crucial pitfalls of feature importance in linear models. If there is only one thing you would take away, I wish it is the checklist 👍 🦖🥊💎🏅 :
- Standardize your dataset
- Different models could have different ideas about feature importance
- Be cautious of highly correlated variables
- Apply cross-validation to check the stability of coefficients across folds
Sign up for Udemy course 🦞:
Recommender System With Machine Learning and Statistics
