Let’s Find Donors For Charity With Machine Learning Models

An application of Supervised Learning Algorithms

Andrei G
Towards Data Science

--

Welcome to my second medium post about Data Science. I will write here about a project I’ve done using Machine Learning algorithms. I will explain what I did without relying heavily on technical language, but I will show snippets of my code. Code matters :)

The project is a hypothetical case study where I had to identify potential donors to a charity that offers funding to people willing to study machine learning in Silicon Valley. This charity, named CharitML found that every donor was making more than $50,000 annually. My task was to use machine learning algorithms to help this charity identify potential donors in the entire region of California.

For the purposes of this project, I used supervised machine learning. Supervised learning is where you have input variables (X) and an output variable (Y ) and you use an algorithm to learn the mapping function from the input to the output. I will describe the steps I took to get from a messy data set to a good working model.

Data Processing

It’s always good to start with exploring the data. Basically, look for the total number of observations, total number of features, missing values, which features should be encoded etc. And most importantly, look for the characteristics specific to the problem you are solving. In this case I was interested in how many people make more and less than $50,000 annually.

There are 13 features.

Transforming Skewed Continuous Features

A data set may sometimes contain at least one feature whose values tend to lie near a single number, but will also have a non-trivial number of vastly larger or smaller values than that single number. Algorithms can be sensitive to such distributions of values and can under perform if the range is not properly normalized. With our data set two features fit this description: ‘capital-gain' and 'capital-loss'.

Normalizing Numerical Features

It is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature’s distribution (such as 'capital-gain' or 'capital-loss' above); however, normalization ensures that each feature is treated equally when applying supervised learners. The features will range between the values of 0 and 1.

Encoding the data

Typically, learning algorithms expect inputs to be numeric, which requires that non-numeric features (called categorical variables) be converted. One popular way to convert categorical variables is by using the one-hot encoding scheme. One-hot encoding creates a “dummy” variable for each possible category of each non-numeric feature.

Shuffle and Split Data

Now all categorical variables have been converted into numerical features, and all numerical features have been normalized. As always, we will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.

Model Metrics

Before I get into the different models I tested for this project, I will elaborate on how we judge the quality of a model. What do we look at when deciding which model to choose? There is always going to be a trade-off between how well we want the model to predict and how computationally expensive it is; or how long it takes to train it.

The three main metrics of model performance are: accuracy, precision and recall.

Accuracy measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

Precision tells us what proportion of messages we classified as spam, actually were spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification).

Recall(sensitivity) tells us what proportion of messages that actually were spam were classified by us as spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam.

The different measures matter in varying degrees in different problems. Hence we should know which metric matters for us more. For instance identifying someone that does not make more than $50,000 as someone who does would be detrimental to CharityML, since they are looking to find individuals willing to donate. Therefore, a model’s ability to precisely predict those that make more than $50,000 is more important than the model’s ability to recall those individuals.

We can use F-beta score as a metric that considers both precision and recall:

Naive Baseline Predictor

If we chose a model that always predicted an individual made more than $50,000, what would that model’s accuracy and F-score be on this data set? We want to know how a model without training would look like.

SUPERVISED LEARNING MODELS

This is not an exhaustive list

Out of these options, I tried and tested three of them: SVM, ADABOOST and Random Forrest.

The Adaboost model is best suitable for our problem. The F score on the testing data set is the highest for Adaboost. Besides that, there is not a big difference in F-scores between the training and testing sets like with RandomForest. This matters because we do not want our model to overfit on the training set and return us an inflated F score. The accuracy score and F score are highest for Adaboost at all training set sizes. The training and testing times are very low, which means that the model is computationally fast. The iterative aspect of the model makes it handle well a high number of attributes like in our case. Hence it is a good choice.

How ADABOOST Works

Adaboost, short for adaptive boosting is an ensemble algorithm. The Adaboost algorithm uses iterative training to give an accurate model. It starts with a weak learner which is the initial classification of the data. Namely, the classification is done with decision stumps. This means that the classification is done just with a line that separates the data. It is called ‘weak’ because the data is not classified very well yet. But due to the further iterations, the model makes the learners focus on the misclassified points. To be more precise, in the first step the weak learner separates the data and all the points are weighted equally. If there are misclassified points, in the second iteration the weak learner tries to capture most of these previous errors and it assigns them higher weights. The essence is that the model focuses on the errors by weighing them more. This iterative process continues as long as we assign it to do it. As a result, with every iteration, the model captures better and better the data. The weak learners are combined and they are assigned weights respective to their performance. Predictions are made by calculating the weighted average of the weak classifiers. The final learner is a strong learner combined from the weak ones.

Model Tuning

We can improve the chosen model further by using Grid Search. The idea is to use different values for some parameters like number of estimators or learning rate, in order to achieve better performance metrics.

Hope you’re not tired of these yet

We can see that the optimized model performed better than the unoptimized model. The accuracy score increased from 0.8576 to 0.8651, and the F-score increased from 0.7246 to 0.7396. Remember that the Naive Predictor gave us an accuracy score of 0.2478 and F-score of 0.2917, which is not surprising because the naive model doesn’t do any training on the data.

Feature Extraction

Out of the 13 features in this data set, I was curious to see which ones have the highest predictive power. And what would happen if we used only those in our model?

In the reduced model both the accuracy and F-score decreased. Less features in this case made the model generalize a bit worse, compared to the full model. However the reduction in scores is not high. In return training time is faster because the model contains less features. Hence, if training time was a factor, this trade off would make sense because we would not lose much in terms of performance.

Summary

So what did we do? We got a data set and we set a target to classify people that make more than $50,000 annually. We cleaned the data, normalized and converted the necessary variables into numerical features so that we can use them in our models. We shuffled and split our data into training and testing sets. We set a baseline predictor and built three other models. We chose the ADABOOST model as the best choice. We tuned it further and made it slightly better. We tried to use the model with only the five main features but it performed slightly worse.

Final Words

This was it :) If you made it through here, I want to say a big thank you. I hope this was a good and clear application of supervised machine learning. It is a powerful tool in data science; something that I am currently studying and want to master. Feel free to comment below if you have questions and you can always have a look at this project, including many others on my github.

Follow me on LinkedIn: https://www.linkedin.com/in/andreigalanchuk/

God bless you all!

--

--