The world’s leading publication for data science, AI, and ML professionals.

From Econometrics to Machine Learning

Why econometrics should be part of your skills

As a Data scientist with a master’s degree in econometrics, I took some time to understand the subtleties that make machine learning a different discipline from econometrics. I would like to talk to you about these subtleties that are not obvious at first sight and that made me wonder all along my journey.

First of all… what is machine learning ?… what is econometrics ?


Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships. More precisely, it is "the quantitative analysis of actual economic phenomena based on the concurrent development of theory and observation, related by appropriate methods of inference"


Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to perform the task

Very well, so they would seem that both need data, both use statistical models, both make inferences, so according to their definitions the machine learning would seem to deal with broader issues than just the economy. So, why does Econometrics still exist ?! This is the question I asked myself when I discovered machine learning at about the same time as the beginning of my econometric studies.

As a futur good econometrist I need to juggle numbers perfectly, have a solid background in [Statistics](https://en.wikipedia.org/wiki/Statistics), be a expert of [linear algreba](https://en.wikipedia.org/wiki/Linear_algebra) and [Mathematical optimization](https://en.wikipedia.org/wiki/Mathematical_optimization) and finally have the computer skills to play with data. These skills will be used to understand, demonstrate apply my regression, classification, clustering algorithms or time series prediction. During this year I will learn very deeply some algorithm like __ [Linear Regression](https://en.wikipedia.org/wiki/Linear_regression)_ ,_ [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression)_,_ [Kmeans](https://en.wikipedia.org/wiki/K-means_clustering)_,_ [ARIMA](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average)_,_ [VAR](https://en.wikipedia.org/wiki/Vector_autoregression)_ …etc. Wait ?_ These algorithms are also used for machine learning !

From theoretical to empirical efficiency

A fundamental difference between machine learning and econometrics lies in their theoretical basis. Econometrics has a solid foundation in mathematical statistics and probability theory. Algorithms are mathematically robust with demonstrable and attractive properties, these algorithms are mainly evaluated on the robustness of their base.

With machine learning, mathematics is of course not absent, but it is present to explain the behaviour of the algorithm and not to demonstrate its reliability and attractive properties. These algorithms are mainly evaluated on their empirical effectiveness. A very revealing example is the success of the Xgboost algorithm which owes its success to its domination over several machine learning competitions rather than its mathematical demonstration.

From exactness to approximation

Another difference is that econometrics has only one solution, given a specified model and a dataset, a parametric regression’s parameters are computed using an algebraic formula. the [best linear unbiased estimator](https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem) (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator in the case where some assumptions are respected. Here "best" means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators.

While most machine learning algorithms are much too complex to be described by a single mathematical formula. Their solutions have been determined algorithmically by an iterative method called the training phase which has the goal to find the solution that best suits our data, so the solution determined by the machine learning algorithm is approximate and is only most likely optimal.

From parametric to non parametric model

Econometric models (ie: parametric most of the time) are based on economic theory. Traditional statistical inference tools (such as the maximum likelihood method) are then used to estimate the values of a parameter vector θ, in a parametric model mθ. Asymptotic theory then plays an important role (Taylor’s developments, the law of large numbers and the central limit theorem…etc).

In machine learning, on the other hand, non-parametric models are often constructed, based almost exclusively on data (no underlying distribution assumptions are made), and the meta-parameters used (tree depth, penalty parameter, etc.) are optimized by cross-validation, grid search algorithm or any hyper-parameter optimization algorithm.

From theoretical to empirical validation

You will have understood it, the pattern will be the same as previsouly econometrics rely on robust mathematical test to validate a model, we commonly talk about the [goodness of fit](https://en.wikipedia.org/wiki/Goodness_of_fit) of a model. It is evaluated by hypothesis testing, evaluation of the normality of residuals, comparisons of sample distributions. We also talk about [R²](https://en.wikipedia.org/wiki/Coefficient_of_determination) which is the proportion of the variance in the dependent variable that is predictable from the independent variable(s),the [AIC|BIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) which evalue the quality of each model, relative to each of the other models or variable evaluations through [p-value](https://en.wikipedia.org/wiki/P-value) .

The evaluation of machine learning models will depend on its prediction, the underlying idea is that if the model is able to predict well then it has successfully learned the hidden patterns in the data. To ensure that the model has not overfitted, the dataset will be separated into a training set and a test set and then come a [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) brick to verify the generalization power of the model and that there is no bias in the separation of data. Finally, we will use KPIs that will give us a measure of the gap with reality likes[RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation), [MAE](https://en.wikipedia.org/wiki/Mean_absolute_error) or [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) .

From theoretical convergence to purpose divergence

Both Econometrics and Machine learning try to define a function that define a set of predictor variable that will model a predicted variable :

  • ɛ are realizations of random variables i.i.d., of law N (0, σ2) also called residual and come from econometrics, otherwise y = f(x) belong to machine learning.

On paper at this stage the two seem to converge, but it is also in the way and objective that they will diverge. Machine learning purpose is y in most of case meanwhile Econometrics purpose is to estimate β of each predictor.

The main purpose of econometrics is not to predict but to quantify an economic phenomenon

From theory to practice !

If we look at these differences in practice, we will start with a classic econometric model and one of the most widely used models, Linear Regression. For this purpose we will observe the results of our modeling through the implementation of the [sklearn](https://scikit-learn.org/stable/index.html)library which mainly serves machine learning models and the implementation of the [statsmodels](https://www.statsmodels.org/stable/index.html)library which is more econometrically oriented.

#import library
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
from sklearn import linear_model
#import data 
iris = sns.load_dataset("iris")

Let’s compare both implementation

dummies = pd.get_dummies(iris["species"], drop_first=False)
iris = pd.concat([iris, dummies], axis=1)
iris.head()

Since Species is a categorical variable we need to convert it to a format that a computer can handle, so we turn to a onehot encoding format. Let’s begin with Machine Learning.

We can extract the model coefficients and the slope parameter beta0 through model object. Let’s give a try with statsmodels.

Statsmodels give us a lot of informations compared to sklearn, we got a very good R², the AIC, BIC that we talk previously, coefficient of each variable and warnings. Let’s try to predict :

We got the same R² and very good mae and Rmse … but we constate that coefficient are not equals between both model. Statsmodels warn us that there’s a possibility our model are [Multicollinear](https://en.wikipedia.org/wiki/Multicollinearity) ! That refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related, that means there’s redondant informations in our dataset. That informations come from the species variable we must drop one category because that obvious if iris is not setosa or verginica is it versicolor.

This means that, although our model has a strong R² and therefore a strong predictive power, these coefficients are biased and uninterpretable.

This information has not been transmitted to us by sklearn. Let’s correct it by passing drop_first = True.

Statsmodel has removed its warning, we now have the unbiased coefficients. Moreover, the skewness are near from 0 and kurtosis too that mean our residus are likely normal, the probability of Jarque-Bera confirm that this is a good model. Let’s rerun our sklearn model :

Finaly we got the same, let’s do some reading. It can be seen that all other things being equal, a 1% increase in the length of the petals increases the width of the petal by 0.24cm. In the case of the categorical variables we always refer to the absent category we can see that all things being equal, the species verginica has a petal wider of 1.04cm than the absent species setosa. All p-values are significant at 5% thresholds so our coefficients are considered robust and unbiased. We have seen the analysis of a linear regression model, which can also be transposed to classification. Logistic regression offers a very interesting[odds ratio](https://en.wikipedia.org/wiki/Odds_ratio)reading in model analysis, I would discuss the reading of the odds ratio in a future article.

reading in model analysis, I would discuss the reading of the odds ratio in a future article.

To conclude

The probabilistic foundations of econometrics are undoubtedly its strength, with not only an interpretability of models, but also a quantification of uncertainty. Nevertheless, the predictive performance of machine learning models is interesting, because they allow us to highlight a poor specification of an econometric model and some of these algorithms are more suitable for unstructured data. Econometrics is intended to be rigorous but becomes a very relevant economic factor analysis tool, if your manager asks you to quantify an effect, it could be relevant in addition to giving you statistical and mathematical legitimacy.


Related Articles