The world’s leading publication for data science, AI, and ML professionals.

Predicting happiness using Random Forest

Exploring what influences happiness levels in people using machine learning.

Photo by Mike Petrucci on Unsplash
Photo by Mike Petrucci on Unsplash

Why do we try to predict Happiness? Being able to predict happiness means that we are able to manipulate or try to improve certain components in order to increase our own happiness, and possibly national happiness for governments. I found Random Forest (RF) to be the simplest and most efficient package, so let’s get started!

Contents:

  1. The Data
  2. Random Forest Model
  3. Data Cleaning
  4. Training and Testing
  5. Feature Importances
  6. Modifying number of variables
  7. Evaluating the Model

The Data:

The data obtained from the #WorldValuesSurvey contains >290 questions & consist of ~69k responses after removing missing data for happiness levels. It is a cross-national survey across the years, and the questionnaire can be found on the website. In particular, we will be looking at the 2017–2020 data set. The size of the data set makes it optimal for Machine Learning.

Random Forest Model:

To start with, we will be using the RF classifier since we would like the machine to predict the level of happiness in groups (Very happy, Quite happy, Not very happy, Not at all happy). _side-note, a RF regressor is used when looking for a number that can take a range of values e.g any value between 0 and 1._

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

Data cleaning: Selecting the data

Let’s start by getting the columns of only the questions and removing negative values* in the responses to Q46 that asks about happiness levels.

var="Q46"
df=df[df.columns[32:349]]
df=df[df[var]>0]

*Negative values are either respondents saying they don’t know, have no answer, were not asked or the response was missing. These values would make it harder for the machine to classify them, since it increases the number of categories and are not what we are looking for.

The data set remaining is shown below:

Further data cleaning:

The next concern is that we would have to deal with missing values in other columns. There are 3 options to consider:

  1. Replace the missing values with 0
  2. Replace the missing values with the mean
  3. Drop the rows with missing values (data set becomes empty).

Since the third option is not viable, we will have to check which option, 1 or 2, would give the highest accuracy. In this case, I found that replacing with 0 makes it more accurate.

df.fillna(0, inplace=True)

Prepare train labels:

Now we set the ‘label’ for the machine to recognize the feature that I want it to predict and split the data into train and test sets.

train_labels = pd.DataFrame(df[var])
train_labels = np.array(df[var])
train_features= df.drop(var, axis = 1)
feature_list = list(train_features.columns)
train_features = np.array(train_features)
train_features, test_features, train_labels, test_labels = train_test_split(train_features, train_labels, test_size = 0.25, random_state = 42)

Train and Test the Model:

The process of training and testing is simple. To improve the predictive power and/or model speed, we can simply modify the parameters within the RF classifier.

Increasing accuracy:

n_estimators – number of trees the algorithm builds before majority voting

max_features – maximum number of features Random Forest considers to split a node

min_sample_leaf – the minimum number of leafs required to split an internal node.

Increasing speed:

n_jobs – number of processors it is allowed to use. If = 1, only use one processor. If =-1, no limit

random_state – makes the model’s output replicable i.e always produce the same results given the same hyperparameters and training data

oob_score: random forest cross-validation method

rf=RandomForestClassifier(n_estimators = 1000, oob_score = True, n_jobs = -1,random_state =42,max_features = "auto", min_samples_leaf = 12)
rf.fit(train_features, train_labels)
predictions = rf.predict(test_features)
print(metrics.accuracy_score(test_labels, predictions))

The model takes 1.3 minutes to train ~52k training rows and >290 columns, and 1 second to test. The accuracy was 63.70%. If we had chosen to fill the missing values with the mean, the accuracy would be 63.55%. But what’s important is finding out what influences the machine’s prediction, since those would be the variables that we want to look at. We certainly cannot expect everyone to answer 290+ questions, or try to work on all 290 aspects to improve happiness (that’s going to cost a lot). So we’ll be looking at the feature importances.

Feature Importances:

If you recall, feature_list contains the columns of all other variables except Q46. The goal is to understand which are the variables that influence the prediction.

importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]
x_values = list(range(len(importances)))
# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical', color = 'r', edgecolor = 'k', linewidth = 1.2)
# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

Feature importances sum to 1 and what we notice is that certain variables have a greater influence over the prediction compared to the others, and almost every variable has some form of influence, albeit extremely small because there are just too many variables. The next thing is to continue improving our model to allow us to better understand happiness.

Modifying the number of variables:

Let’s take the top 20 features, and set up a new model using just these 20 variables (+ var itself). We’ll repeat the data cleaning and same RF model. I got an accuracy of 64.47%. If we had chosen to replace missing values with the mean, the accuracy would be 64.41%. What is surprising here is that with smaller number of variables, the model becomes more accurate (from 63.70% to 64.47%). This is likely because the other variables were generating noise in the model and causing it to be less accurate.

Let’s look at the Feature Importances again:

This time, it is clearer to tell which variables were more important. You may refer to the questionnaire found on WVS for more detailed information. I will give a summary of the topics that the questions covered.

Evaluating the model:

Let’s look at the graph of actual vs predicted values for the first 200 test values. For greater visibility of the whole test set, let’s also do a simple count for the difference in values of predicted and actual (predicted minus actual).

The model appears to be slightly more negative than positive in predicting the happiness levels, but would still be considered otherwise balanced!

Insights:

What I have done is to examine the key questions out of >290 in the WVS that is more relevant to happiness levels. This would mean that we can try to focus specifically on these aspects when examining happiness.

Looking at the questionnaire, we would also notice that Q261 and Q262 are the same thing (age and year born), so we could remove 1 of them to include another feature. For Q266,267,268 (country of birth of the respondent and parents) they appear to be repeats, but are not exactly the same thing since immigration/cross-cultural marriage may occur. Nonetheless, we could consider removing 2 of them since the occurrence is minimal.

The general topics are:

Individual level: Life satisfaction, health, finances, freedom, age, safety, religion, marriage, and family. National level: country, perception of corruption, democracy/political influence, national pride

In particular, health, finances and age were the top features that were deemed as important by the machine. In this sense, the individual level factors has a greater influence on one’s happiness level compared to the national level factors.

However, I noticed that the WVS did not have data on sleep hours, which was a key element that was observed in my earlier post. Nonetheless, it is still very much useful as we can consider those aspects for further analysis! I’ll be back with more insights into the correlation between those aspects and happiness, to determine how we can improve our happiness levels. Until then, remember to stay happy!


Related Articles