Kaggle Competition — Finding Donors for a Charity with an AUC of 0.94

John Chen (Yueh-Han)
Towards Data Science
11 min readSep 6, 2021

--

Used Python to validate the performance of Random Forest, Gradient Boosting, and XGBoost using AUC and built a final model to predict potential donors for a Charity.

Photo by Markus Winkler on Unsplash

Project Overview

This project will employ 3 supervised algorithms, including Random Forest, Gradient Boosting, and XGBoost, to accurately model individuals’ income using the 1994 U.S. Census data. I will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. My goal with this implementation is to construct a model that accurately predicts whether an individual makes more than 50,000 dollars. This sort of task can arise in a non-profit setting, where organizations survive on donations. Understanding an individual’s income can help a non-profit better understand how large of a donation to request or whether or not they should reach out. While it can be difficult to determine an individual’s general income bracket directly from public sources, we can infer this value from other publically available features. The Kaggle competition link is in here.

Dataset Overview

The dataset for this project originates from the UCI Machine Learning Repository. Ron Kohavi and Barry Becker donated the dataset after being published in the article “Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid.” You can find the article by Ron Kohavi online.

Problem Statement

A charity wants to find out who is likely to donate. As a data scientist, I can utilize this dataset and supervised machine learning algorithms to predict potential donors for the charity to reach out.

Process

(Check full code here)

Step 1. Assessing

Step 2. Preprocessing

Step 3. Calculating the Performance of a Naive Predictor

Step 4. Selecting 3 Appropriate Model Candidates

Step 5. Creating a Training and Predicting Pipeline

Step 6. Initial Model Evaluation and Picking the Best Model

Step 7. Model Tuning

Step 8. Preprocessing the testing data from Kaggle

Competition Result

Note: This article is not meant to explain every line of code but the most important part of the project. Therefore you may find some parts that are just the descriptions of the results. If you are interested in the code itself, please check here.

Now, let’s get started!

Step 1. Assessing

After using some basic assessing functions, I got the info: This dataset has 14 columns, 45222 rows, and 0 missing values.

The last column, income, is the target variable. Photo by author

Feature explanation:

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, married-F-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspect, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

See Distribution

Photo by author

As we can see, the distributions of age, capital gain, and capital loss are skewed to the right, so later, I will transform them by using logarithmic transformation.

Assessment report:

  • Age, Capital-loss, Capital-gain are skewed to the right
  • All numeric features should be normalized
  • The target variable should be mapped into 1 and 0
  • All categorical features should be one-hot encoded

Step 2. Preprocessing

Preprocessing 1: Age, Capital-loss,Capital-gain are skewed to the right.

After conducting logarithmic transformation, they became

Photo by author

We can see that they become more centralized, the age now is way better, and capital-gain and capital-loss are slightly better.

Preprocessing 2: All numeric features should be normalized

After using MinMaxScaler from Sklearn to normalize all the numeric features, the data range will be between 1 and 0.

Photo by author

Preprocessing 3: The target variable should be mapped into 1 and 0

For the target variable, income, I used the .map function to map ‘>50K’ to 1 and ‘<=50K’ to 0. And the result looks like this

All the values become either 1 or 0. Photo by author

Preprocessing 4: All categorical features should be one-hot encoded.

In the last data preprocessing step, I used pd.get_dummies() to one-hot encode the categorical variables.

For example. Originally, the “workclass” variable has several categories, including Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, and Never-worked. Here is what “workclass” became after being one-hot encoded.

One-hot encoding divided one column into multiple columns, where every category has its own column. And if for that row, it is state_gov like row 1, then state_gov will have 1 in the cell—photo by author.

Step 3. Calculating the Performance of a Naive Predictor

Generating a naive predictor aims to show what a base model without any intelligence would look like. So, when you present your report to managers, they compare it with your models to see if they’re really adding value.

In the real world, ideally, your base model would be either the results of a previous model or could be based on a research paper upon which you are looking to improve. When there is no benchmark model set, getting a result better than random choice is where you could start.

Here, I set the naive predictor that always predicts ‘1’ (i.e., the individual makes more than 50k). The model will have no True Negatives(TN) or False Negatives(FN) as I’m not making any negative(‘0’ value) predictions. Therefore our Accuracy, in this case, becomes the same as our Precision(True Positives/(True Positives + False Positives)) as every prediction that we have made with value ‘1’ that should have ‘0’ becomes a False Positive; therefore our denominator, in this case, is the total number of records we have in total.

In this competition, as we will be judged by the AUC, the area under the ROC curves. Therefore, I will be calculated the AUC and accuracy to validate models throughout this project.

For the naive predicator’s accuracy, as we have 11001 True Positives and 44445 all Positives, we will get an accuracy of 0.24.

AUC is calculated by projecting all the prediction results, the probabilities of each point belong to positive and negative, into a one-dimensional line. Then we calculate the True Positive Rate and False Positive Rate of several points in that line. And then, it plots all those points into a 2D chart, where you can calculate the area under the curve.

So, in this case, since all points are predicted to be positive, they are all gathering on the right side of the line. And when it calculates any points in the middle or left sides, it will get a True Positive Rate of 1 because all True Positives are correctly classified and a False Positive Rate of 1 because all the Negative points are incorrectly classified. And there would be few points that have (True Positive Rate, False Positive Rate) = (0, 0) since it is on the left side of the line.

So when we plot many (1,1) and (0,0) into a 2D graph, we will find the triangle area, which equals 1/2 since 1*1*1/2 = 1/2.

Photo by author

Summary of a Naive Predictor:

Accuracy = 0.2475

Area Under Curve = 0.5

And later, we can compare these numbers with the AUC and Accuracy of the model that we will employ to see if those models really add value.

Step 4. Selecting 3 Appropriate Model Candidates

I list three supervised learning models, including Random Forest, Gradient Boosting, and XGBoost. For each model chosen, I will answer the following questions:

What are the strengths of the model?

What are the weaknesses of the model?

Random Forest

  1. What are the strengths of the model; when does it perform well?
  • Random Forest is considered a highly accurate and robust method because of the number of decision trees participating in the process.
  • It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.
  • The algorithm can be used in both classification and regression problems.
  • Resource: (https://www.datacamp.com/community/tutorials/random-forests-classifier-python)

2. What are the weaknesses of the model; when does it perform poorly?

  • Random Forest is slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to predict for the same given input and then vote on it. This whole process is time-consuming.
  • The model is difficult to interpret compared to a decision tree, where you can easily decide by following the path in the tree.
  • Resource: (https://www.datacamp.com/community/tutorials/random-forests-classifier-python)

Gradient Boosting Trees

  1. What are the strengths of the model; when does it perform well?

2. What are the weaknesses of the model; when does it perform poorly?

XGBoost

  1. What are the strengths of the model; when does it perform well?
  • Speed and performance: Originally written in C++, it is comparatively faster than other ensemble classifiers.
  • Works well on large datasets: Because the core XGBoost algorithm is parallelizable, it can harness the power of multi-core computers. It is also parallelizable onto GPUs and across networks of computers, making it feasible to train on huge datasets.
  • It can be used in selecting important features.
  • Less feature engineering required (No need for normalizing data, can also handle missing values well)
  • Consistently outperforms other algorithm methods: It has shown better performance on various machine learning benchmark datasets.
  • Wide variety of tuning parameters: XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, Sklearn compatible API, etc.
  • Outliers have minimal impact.
  • It performs well when data has a mixture of numerical and categorical features or just numeric features.

2. What are the weaknesses of the model; when does it perform poorly?

  • Overfitting is possible if parameters are not tuned properly.
  • Harder to tune as there are too many hyperparameters.
  • When It’s not suitable to use XGBoost: Image recognition, Computer vision, Natural language processing, and understanding problems.
  • What makes this model a good candidate for the problem, given what you know about the data?
  • XGBoost is a frequent winner among Kaggle competitions.
  • It supports both classification and regression problems.

Step 5. Creating a Training and Predicting Pipeline

This pipeline will do five things: Training, Predicting, Documenting the Training/Predicting Time, Calculating Accuracy, and Calculating AUC.

Here is the code:

Step 6. Initial Model Evaluation and Picking the Best Model

Photo by author

XGBoost is the best model

  1. We can see that Random Forest is overfitted a bit as it has a high training score but a relatively low testing score. In contrast, XGBoost performs the best as it gets the highest testing score, and also, the training score and testing score are nearly the same, which means it’s neither underfitting nor overfitting.
  2. Although XGBoost spent the highest amount of time for the training time, it only takes 12 seconds, so it’s acceptable in this dataset.
  3. XGBoost performs way better than a naive predictor, with an accuracy of 0.2475 and an AUC of 0.5. And an unoptimized XGBoost has already hit an accuracy of 0.85 and an AUC of 0.8.

Describing XGBoost in Layman’s Terms

Imagine asking a bunch of high school students to solve a set of college-level math questions. Each high school student has some level of math knowledge, but none of them is incredibly talented in or terrible at math. The students take turns to solve the problems, and the teacher will give them a score after finishing the questions. Besides, the teacher will tell the next student what questions the previous student has done wrong so that they will be more mindful of those questions. Repeat this process until every student has answered the questions. In the end, the teacher trusts more on the answers given by the students who got high scores and trusts less on the answer given by the students who got low scores, and she answers the questions herself. XGBoost is the teacher. This is how XGBoost is trained.
Next time, when other college-level questions are given to the teacher, she uses this set of memory to answer the questions, which is how XGBoost predicts.

Step 7. Model Tuning

Photo by author

Here, the total possible sets of parameters are 2⁵ = 32. So, If I used the grid search, it would run 32 times to find the best one. And since it will be using either 300 or 400 estimators to test, it can take super long to run it 32 times.

Therefore, I used the randomized search to find the best set of hyper-parameters, which will randomly select the set of hyper-parameters to test. So, it isn’t guaranteed to find the best one, but it runs faster. And here is the results:

Photo by author

We can see that both accuracy and AUC increased a bit.

Step 8. Preprocessing the testing data from Kaggle

To correctly preprocess the testing set, we have to follow the same procedure we used to preprocess the training set, including transforming Age, Capital-loss, Capital-gain, normalizing all numeric features, and one-hot encoding all categorical features.

But before all that, we have to see if this testing set has missing values first.

Photo by author

As we can see, almost every column has its own missing values, so here I consider three situations. Firstly, if the numeric data is skewed, then I will fill in the missing values with the median of the same column from the training set. Secondly, if the numeric data is around normally distributed, then I will fill in the missing values with the mean of the same column from the training set. The last situation is the categorical values. Then I will use the mode or the category that appears the most in the same columns from the training set to fill in the missing values. After filling in the missing values, I conducted the same preprocessing process as doing the training set.

After preprocessing the data, I then put them in the optimized XGBoost model, and here is the competition result.

Kaggle Competition Result

Out of 212 teams, I was ranked 30th!

Photo by author

Thank you for reading to the end! If you are interested in the full code of this project, please check out my Github. Besides, I love feedback. If there is any part that is unclear or should be done better, please reach out to me. Here is my LinkedIn

--

--