Predicting EEOC Discrimination Investigations
The results are bleak.
The primary objective of this post is to share the results of a data science project concerned with predicting outcomes of the U.S. Equal Employment Opportunity Commission’s (EEOC) investigations on employment discrimination claims. This is a highly imbalanced (99:1), binary classification problem. As such, the goal is to build a model that can best predict the target minority class, which is the EEOC finding merit for employment discrimination. Long story short, it is difficult to reliably predict discrimination investigations where the EEOC will find merit. This is likely due to poor data quality.
In the last couple of years, several journalists have covered the state of discrimination claims and the EEOC’s ability to investigate them. The goal of this research is to see whether investigation outcomes can be predicted in an attempt to help the EEOC pursue investigations while being under-budgeted. It should be noted, however, that data quality has a long way to go before a meaningful model can be created. Currently, the data is biased from decades of under-budgeting and an agency created with no real ability to protect most people who experience employment discrimination. Therefore, the data used does not provide full information, and the baselines are likely an underestimation of both the true nature of discrimination claims in the U.S. as well as the full ability of the EEOC to determine discrimination from investigations.
Data comes from the Center for Public Integrity and includes all discrimination claims for the 2010 fiscal year.
I. Wrangle Data
Data wrangling had three parts: creating the target variable, engineering features, imputing categorical values, and dropping variables that were either redundant or prone to leakage.
The target variable (decision) was simplified, based on ‘Closure Type’, to include two possible outcomes: discrimination found and no discrimination found. Although technically a claim can be closed and no determination can be made. For example, if the EEOC takes, or will take, longer than 180 days to complete an investigation, the complainant can request a notice of the right to sue (NRTS). In doing so, the EEOC will automatically close their investigation and no determination on discrimination will be made.
Some additional features were engineered to try and improve the predictive power of the models. These include: ‘age’ at the time of filing the claim, the ‘NAICS code’ broadened to the industry level, ‘investigation duration,’ and whether or not the complainant received ‘monetary benefits.’
II. Split Data
The model is based on time-series data for the 2010 financial year. As such, the data is split proportionally to maintain chronological order. The training data is the first 60% of rows, and the validation and test sets are each 20%.
III. Establish Baseline
The baseline in a severely imbalanced binary classification problem is established by the prevalence of the minority group. For this dataset, investigations that find merit for discrimination is 0.0127. This score will be measured against precision-recall area-under-curve (PR AUC) scores in the evaluation phase.
IV. Building the Models
SimpleImputer and StandardScaler were applied to numerical features, and OrdinalEncoder and BinaryEncoder applied to the ordinal and nominal categorical features, respectively.
The base model for this dataset was a logistic regression, and the alternative model was a random forest. To evaluate their performance I will use an f1_score with an evenly weighted average and the PR AUC score.
Linear Model: Logistic Regression
Out of the box, a logistic regression model returned a precision-recall AUC score of 0.033 — slightly better than the baseline. However, the f1_score is < 0.50 which most likely indicates that the model is not predicting any of the claims as having discrimination.
Bagging Model: Random Forest
With no tuning, a random forest model returned a PR AUC score of 0.092, significantly better than the logistic regression model and baseline. The f1_score is also greater than 0.5, which likely means the model is classifying some claims as having cause for discrimination. Another good sign.
Confusion Matrices
The next step is to apply the better fitting model, Random Forest, to our test data to see how well it performs. We can use confusion matrices to accomplish this. To further illustrate the results of the project, I’m including a confusion matrix for the validation and test sets.
We can see from these matrices that these models would have no practical use in predicting discrimination, especially when the number of claims are particularly low (the test set includes only 146 claims where discrimination was identified). On the validation set, we can see great precision but an incredible loss to recall.
In a practical setting where we hold government agencies accountable, a predictive model would serve the best interests of the workforce if it has high recall. That means it is accurately detecting all instances where employment discrimination has occurred. Usually, this also means it will get some predictions wrong (i.e. a claim that actually has no discrimination is labeled as having discrimination). Whereas a model with high precision cannot capture all of our true discrimination cases, but when it does predict a claim has merit, it will always be right. Realistically, there is almost always a tradeoff between these measures.
Permutation Importances
We can also consider the permutation importance of the features used in the models, which also reflects their weaknesses.
Ideally, we want features to have positive importance as that is a sign they have predictive power. In this case, all of our features have positive importance but their magnitudes are fairly small (close to zero).
V. Results + Recommendations
The results shine a light on some of the flaws in the quality of data collected on discrimination claims and the ambiguous nature of EEOC investigations. It is hard to believe that many of the features would have such small predictive power, but this may be from a lack of other important variables. Most notably, around investigation procedures. It’s unclear if they are standardized or what the threshold is in making a decision on a claim. We also don’t know how the EEOC rates the strength of evidence provided by employees and their employers.
The Center for Public Integrity has a larger dataset covering discrimination claims from 2011–2017. It may be worthwhile to attempt this project again with more data. If a model with real predictive power can be established, we should question whether or not high precision was prioritized over low recall. Otherwise, it will be difficult to know the true effectiveness of the EEOC to conduct investigations.