Comparison of three algorithms’ accuracy at predicting 2014 life expectancy

Will Multivariate Linear Regression, Random Forest, or XGBoost show the best results in terms of R², MSE and RMSE?

Yaremko Nazar
Towards Data Science

--

Link to GitHub repository: https://github.com/nazaryaremko/Data-Science-Tutorial-Project

With the continuous development of statistical models and their rapid computerization, data scientists have gotten access to more and more tools that can be used to extract valuable information from publically available datasets. What required hundreds of lines of code and hours spent trying to understand how to implement a model, can now be done with a few lines of code that implement the functionality of powerful statistical libraries. A question that many undergraduate data scientists face when analyzing public datasets is what tools to use — when there are so many of them available, how can we make sure that a proper algorithm is chosen to solve a concrete task?

In this article, I will implement three different models to solve a Regression Problem: Multivariate Regression, Random Forest, and XGBoost. Specifically, I will use a public Kaggle Life Expectancy dataset by WHO which contains data from 193 countries. The dependent variable which we will try to predict will be the average life expectancy for each country. There are X predictors including variables like adult mortality rates (both sexes) per 1000 population and number of infant deaths per 1000 population.

Exploratory Data Analysis

Before implementing the models, I performed a simple exploratory data analysis to show what kind of dataset I am working with. I started with uploading the dataset and filtering out all the rows that were not from 2014. I chose to work specifically with data from 2014 because wanted to choose the most recent data from the dataset, and there was a lot of missing data for the rows from 2015.

life_expectancy = pd.read_csv(io.BytesIO(uploaded['Life Expectancy Data.csv']))life_expectancy = life_expectancy[(life_expectancy.Year == 2014)].iloc[:,0:]

Furthermore, I ran the following commands to show the shape and the head of the resulting data frame:

life_expectancy.shape
Image by Author.
life_expectancy.head(10)
Image by Author.
print(" \nCount total NaN at each column in a DataFrame : \n\n",life_expectancy.isnull().sum())
Image by Author.

Pre-processing the Data

Based on the EDA results, we can see that there are missing values in around half of the columns. One common and simple way to deal with the missing values is to substitute all of them with the medians or means of the respective columns. One issue with this approach is that it heavily decreases the variance of the attributes which is likely to result in poor predictions (Kubara, 2019). Hence, I resorted to another approach — Iterative Imputation. Iterative imputation is a method that uses other features in the dataset to estimate missing values in a given feature. This process is performed sequentially, so the newly imputed values can be used as a part of the model in predicting other features, and iteratively, improving the accuracy of the estimates (Brownlee, 2020).

To implement this method, I used the IterativeImputer method from sklearn library. The resulting dataset was then used for model fitting and prediction.

Multivariate Regression

Regression is one of the most common methods used in data science. Regression is used to mathematically quantify the role of an independent variable (or in Multivariate Regression — multiple variables) at impacting a dependent variable (Multivariate regression, n.d.). In our case, we have 18 possible predictor variables and our goal is to find a combination of them that will produce a model with the highest accuracy. Before choosing the predictor variables, I split the dataset into training and testing sets and fit a linear model using the training set. Furthermore, I predicted the dependent variable values based on the values of independent variables in the test set. Finally, I calculated MSE, RMSE, and R² values to evaluate the performance of our model:

Image by Author.

To pick the best combination of predictor variables, I decided to use the leaps library in R which performs an exhaustive search of the best combination of independent variables to predict a dependent variable in linear regression. I imported the dataset to R, and ran the following command

regsubsets.out <-
regsubsets(Life.expectancy ~ Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + Hepatitis.B +
Measles + BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + Population +
thinness..1.19.years + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data,
nbest = 1,
nvmax = NULL,
force.in = NULL, force.out = NULL,
method = "exhaustive")
plot(regsubsets.out, scale = "adjr2", main = "Adjusted R^2")

The output of this code is a table of combinations of predictors and adjusted R² values for models with corresponding combinations of predictors.

Figure 1. Adjusted R² values for different combinations of predictor variables in a multivariate regression model. Each row of the figure has black boxes (the variable is used in the model) and white boxes (the variable is not used) and it represents a separate multivariate model with its own R² value. Image by Author.

Two aspects of this table that we need to take into account are the number of predictors and the R² value. The goal is to pick a model that is parsimonious but still has high accuracy. There are a few combinations of variables we can test that both have a high R² and a small number of variables:

  • Model #1: Adult Mortality, HIV/AIDS, Income composition of resources (ICR);
  • Model #2: Adult Mortality, HIV/AIDS, ICR, Total Expenditure;
  • Model #3: Adult Mortality, HIV/AIDS, ICR, Total Expenditure, Hepatitis B;

To test the performances of these models, I repeated the same steps as described earlier and calculated MSE, RMSE, and R²:

Image by Author.

As we can see, model #3 has the best performance according to the values of chosen metrics.

Figure 2. Actual vs. Predicted values of Life Expectancy in different countries. Predicted values were calculated by fitting a Multivariate Regression model on the test part of the dataset. Image by Author.
Image by Author.

Random Forest

Random forest is a supervised learning algorithm that uses multiple decisions trees for classification and regression tasks (Yiu, 2021). Random forest is one of the most accurate learning algorithms available. An important aspect of this algorithm is that all of the decision trees involved in it run in parallel to each other and the mean prediction (in regression tasks) or the mode of the classes (in classification) among all trees is returned. This leads to reduced overfitting, variance, and higher accuracy of the results (A complete guide to the random forest, 2021).

To implement the Random Forest model, I used the RandomForestRegression function from sklearn library. At first, I used all of the predictor variables available to fit the model. This resulted in the following metric values:

Furthermore, I also used the SelectFromModel function which identifies the variables with the highest importance in the model. I only used the model trained on the training set to avoid overfitting. The output of this function included Adult Mortality, HIV/AIDS, ICR, Schooling. Then, I fit the same Random Forest model again but only included these four variables which gave us the following results:

Figure 3. Actual vs. Predicted values of Life Expectancy in different countries. Predicted values were calculated by fitting a Random Forest model on the test part of the dataset. Image by Author.
Image by Author.

As we can see, the results are slightly better, and we reduced the model to only four variables.

XGBoost

XGBoost or extreme gradient boosting is a machine learning algorithm that has recently gained popularity due to its high efficiency at performing common tasks like regression and classification (Brownlee, 2021). Like Random Forest, XGBoost is using decision trees to estimate model parameters, however, there is a key difference — XGBoost is using them in a sequential manner, meaning that it fits one model, and then uses its residuals to fit the next model and so on. Overall, XGBoost is a gradient boosting algorithm (it uses a gradient descent algorithm to minimize the loss function) that uses decision trees as its predictors (Morde, 2019). It works very well with structured, tabular data which is why it is expected to produce good results with the WHO dataset.

To apply the XGBoost algorithm to our regression problem, I will use the XGBRegressor function from xgboost library. A great feature of this library is that it gives us information about feature importance after fitting the model.

Figure 4. Importances of different independent variables in the XGBoost model. All of the importances sum up to 1, and variables with the highest importance have the biggest influence on the dependent variable. Image by Author.

As we can see, features 11 and 16 have the highest importance, followed by features 17, 0, and 15. The library also has a very useful SelectFromModel function that selects the best combination of predictors based on a provided threshold of feature importance. To check how the model will perform for different values of the threshold, I sorted the output of XGB.feature_importances_ function and fitted 18 models, starting with all variables and at each next step discarding one variable with the lowest importance. Based on the model metrics, the best-performing model had the following 15 predictor variables:

Image by Author.

and produces the following model metric results

Image by Author.
Figure 5. Actual vs. Predicted values of Life Expectancy in different countries. Predicted values were calculated by fitting an XGBoost model on the test part of the dataset. Image by Author.

As we can see, the XGBoost algorithm produced the best results having HIV/AIDS and ICR as its most important predictors.

Result validation using time series data

Based on the values of MSE, RMSE, and R² for three different models, XGBoost has achieved the highest accuracy in predicting Life Expectancy. Furthermore, as shown in figure 4, there are two variables with significantly higher importance values compared to the others — HIV/AIDS and ICR. These two variables have also come up as the most important ones in Multivariate Regression and Random Forest models.

In this dataset, HIV/AIDS is defined as ‘Deaths per 1 000 live births HIV/AIDS (0–4 years)’ and ICR is defined as ‘Human Development Index in terms of income composition of resources (index ranging from 0 to 1)’. Although the importance of the first variable in predicting Life Expectancy seems pretty obvious, it is interesting that out of other mortality rate variables such as ‘under-five deaths’ or ‘Adult Mortality’, specifically the HIV/AIDS-related mortality rate had the highest impact on the dependent variable.

To see whether the importance of these variables is reflected in the temporal trends of these variables, I used data from ‘HIV/AIDS’, ‘Income Composition of Resources’ and ‘Life expectancy’ from 2000 to 2015 and plotted a few simple visualizations.

Figure 6. Scatterplots of Life Expectancy vs. ICE and HIV/AIDS in 2002 and 2015. Life expectancy has a moderate negative correlation with HIV/AIDS and a moderate/strong positive correlation with ICE. Image by Author.

As we can see, both in 2002 and 2015, Life Expectancy has a moderate negative correlation with HIV/AIDS and a moderate positive/strong positive correlation with ICR.

Figure 7. Time series Life Expectancy and ICE data from 2000 to 2015. Both variables are shown in their respective scales. Image by Author.

In a time series, Life Expectancy and ICR seem to have a very similar growth rate. This trend is common among many countries.

Figure 8. Time series Life Expectancy and HIV/AIDS data from 2000 to 2015. Both variables are shown in their respective scales. Image by Author.

For Life Expectancy and HIV/AIDS, as expected, in many countries we observe an inverse trend.

Conclusion

In this project, I implemented three algorithms to perform a regression task. Specifically, I used Multivariate Regression, Random Forest, and XGBoost to predict 2014 Life Expectancy in different countries. Out of three algorithms, XGBoost performed the best, according to the values of MSE, RMSE, and R². Two independent variables, ICR and HIV/AIDS showed the highest importance in predicting Life Expectancy. Further exploration of data from 2000 to 2015 confirmed a strong relationship between independent and dependent variables. Each year, Life Expectancy had a moderate negative correlation with HIV/AIDS and a moderate/strong positive correlation with ICR. Additionally, in time series plots, as expected, an increase in Life Expectancy corresponded to an increase in ICR and a decrease in HIV/AIDS. These results point towards the role ICR and HIV/AIDS might play in influencing the average life expectancy of a country. Although not all countries shared the same trends, and multiple outliers were found, their exploration and closer analysis of each individual case would be required to make solid conclusions.

References

Brownlee, J. (2020, August 18). Iterative imputation for missing values in machine learning. Machine Learning Mastery. Retrieved November 3, 2021, from https://machinelearningmastery.com/iterative-imputation-for-missing-values-in-machine-learning/#:~:text=Iterative%20imputation%20refers%20to%20a,where%20missing%20values%20are%20predicted.

Brownlee, J. (2021, February 16). A gentle introduction to XGBoost for applied machine learning. Machine Learning Mastery. Retrieved November 5, 2021, from https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/.

A complete guide to the random forest algorithm. Built In. (2021, July 22). Retrieved November 5, 2021, from https://builtin.com/data-science/random-forest-algorithm.

Kubara, K. (2019, June 24). Why using a mean for missing data is a bad idea. alternative imputation algorithms. Medium. Retrieved November 3, 2021, from https://towardsdatascience.com/why-using-a-mean-for-missing-data-is-a-bad-idea-alternative-imputation-algorithms-837c731c1008.

Morde, V. (2019, April 8). XGBoost algorithm: Long may she reign! Medium. Retrieved November 5, 2021, from https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d.

Multivariate regression. Brilliant Math & Science Wiki. (n.d.). Retrieved November 5, 2021, from https://brilliant.org/wiki/multivariate-regression/#:~:text=Multivariate%20Regression%20is%20a%20method,responses)%2C%20are%20linearly%20related.&text=A%20mathematical%20model%2C%20based%20on,and%20other%20more%20complicated%20questions.

Rajarshi, K. (2018, February 10). Life expectancy (WHO). Kaggle. Retrieved November 3, 2021, from https://www.kaggle.com/kumarajarshi/life-expectancy-who.

Yiu, T. (2021, September 29). Understanding random forest. Medium. Retrieved November 5, 2021, from https://towardsdatascience.com/understanding-random-forest-58381e0602d2.

--

--

A Computer Science and Natural Sciences student with passion for software engineering